Elasticsearch: Getting a List of Distinct Values

I started using Elasticsearch a little more than a year ago. Elasticsearch is a distributed, open source search and analytics engine, designed for horizontal scalability, reliability, and easy management[1]. It has a very good, easy to use RESTful API so, you can use it with any web client. However, from Java I prefer to use the dedicated Java API. Recently I was working on indexing e-mails with a special focus on labels associated with the messages. Despite the Elastic API is pretty well documented, it took me a good amount of googling to figure out how to get back a list of unique labels. In Elasticsearch documents (in my case the e-mails) are organized into indices and to get back a list of distinct values of a property (e-mail labels) we need to use the so-called aggregation API. Here’s my solution:

import static java.util.stream.Collectors.toList;

import java.util.List;
import java.util.concurrent.ExecutionException;

import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.aggregations.Aggregation;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.bucket.terms.StringTerms;

public class AggregationRepository {
  private Client client;

  public AggregationRepository(Client client) {
    super();
    this.client = client;
  }

  public List getDistinctLabels() 
    throws InterruptedException, ExecutionException {
    SearchRequestBuilder aggregationQuery = 
      client.prepareSearch("emails")
        .setQuery(QueryBuilders.matchAllQuery())
        .addAggregation(AggregationBuilders.terms("label_agg")
          .field("labels").size(100));
    SearchResponse response = aggregationQuery.execute().get();
    Aggregation aggregation = response.getAggregations().get("label_agg");
    StringTerms st = (StringTerms) aggregation;
    return st.getBuckets().stream()
      .map(bucket -> bucket.getKeyAsString())
      .collect(toList());
  }
}


This is how the search query posted to the API looks like:

{
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "label_agg" : {
      "terms" : {
        "field" : "labels",
        "size" : 100
      }
    }
  }
}


And the response is…

{
  ...
  "aggregations": {
    "label_agg": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Gmail",
          "doc_count": 1
        },
        {
          "key": "Yahoo",
          "doc_count": 1
        }
      ]
    }
  }
}


Just to give a little bit of explanation: label_agg is the name of the aggregation – you can choose it however you wish. The results are coming back as buckets and the keys collected into a list gives us the list of unique labels. Of course, the aggregation API is capable of much more. This was just a simple example how to select distinct values from documents stored in Elasticsearch.

Resources

  1. Powering Data Search, Log Analysis, Analytics https://www.elastic.co/products

Crossing the Chasm by Geoffrey A. Moore

“First, there is a mountain, Then there is no mountain, Then there is.”

Is this some kind of magic trick? Nope, this is how business to business marketing works in the high tech industry. And this what Crossing the Chasm book is about: marketing and selling technology products to mainstream customers.

Before you stop reading because “marketing is not for me”, I promise it’s going to be interesting just bear with me for a little longer. In fact, I think every developer should read that book to get a wider perspective on what happens with the code we write. How it becomes a real product and how marketing decisions can drive a product with superior technology into the ground while a seemingly inferior product triumphs.

Crossing the Chasm book cover
Crossing the Chasm book cover

Disruptive Innovation

First, we have to understand the distinction between continuous and discontinuous or disruptive innovation. BMW offering a faster, more dynamic gasoline car with a slightly better design is an example of continuous innovation. In contrast, all electric cars from Tesla represent a discontinuous innovation. Because they demand significant changes not only by the customer but also by the supporting infrastructure. We cannot just stop, fill up our electric car and be on our way in a matter of minutes – or at least not for now. We need to stop for like 30 minutes maybe have a bite while our car gets charged. In exchange, we get a cleaner, more sustainable and more efficient form of transportation.

The Technology Adoption Life Cycle

“The Technology Adoption Life Cycle model describes the market penetration of any new technology product in terms of a progression in the types of consumers it attracts throughout its useful life. The groups are distinguished from each other in terms of how they react to disruptive innovation.”

Innovators or technology enthusiasts pursue technology products simply for the pleasure of exploring the new capabilities they provide. Early adopters or visionaries appreciate the benefits of a new technology. They typically have a greater vision and they see how a new technology product could support their goals. The early majority shares the early adopter’s ability to relate to technology, but they are driven by practicality. They want to see how other people making out before they invest substantially. The late majority is even more conservative they wait until the new technology becomes a standard. Laggards are people who only buy the technology when it becomes a necessity or when it’s so deeply integrated with something existing that they don’t even know they bought it. Each group has a unique psychographic profile – price sensitivity, expectations, and priorities. Therefore selling products to customers in each group requires fundamentally different approach and marketing communication.

Discovering the Chasm

As we work ourselves through the technology adoption lifecycle we discover that the psychographic groups are divided from each other by gaps. “This symbolizes … the difficulty any group will have in accepting a new product if it is presented in the same way as it was to the group to its immediate left”. The biggest disconnect is between the innovators and the early majority – this is the chasm that many start-up ventures have fallen into. And because there are so many business customers on the right side of it crossing the chasm is fundamental to making any significant profits.

Visionaries or innovators expect a radical discontinuity between the old ways and the new, and they are prepared to fight the resistance. They are prepared to bear with the inevitable bugs and glitches of the new product. By contrast, what early majority wants is a productivity improvement for existing operations. They do not want to debug someone else’s product. They are looking for a solution integrated with current systems. Because of these incompatibilities, visionaries do not make good references for early majority. The only suitable reference, it turns out, is another member of the same group. This leads us to a catch-22 situation which needs to be solved because references play an utmost role in buying decisions of early majority customers.

Crossing the Chasm

To cross the chasm high technology companies need to launch a D-day type of invasion where they first focus on a single niche market segment to secure the beachhead. All development and marketing efforts should be focused on a single target. This is the only way to cross the channel … sorry, the chasm. Once the beachhead is secured you can advance to adjacent market segments until you become the market leader. But how to choose which beachhead to attack? You will need to select a market segment with an acute problem and provide a solution that current mainstream products cannot provide. They neglect a particular segment perhaps because they can make a tremendous amount of profit with a more general product.

Let me illustrate this with an example from the book. When farmaceutical companies introduce a new drug they deal with a very complicated documentation process. They were often delays just because the tracking and versioning of the documentation was a nightmare. Then came a company called Documentum that solved this literally million dollars a day problem. Guess what, effective document handling is also important in a number of other niches like finance, human resources etc. The trick is once you dominate a segment with customers of the early majority they serve as a very good reference for the same type of customers in adjacent segments.

Creating the Competition

When choosing the point of attack try to select a small pond where you can be the big fish. Of course, the pond should be big enough to keep you alive. And in any big enough pond there will be a competition. But competition is good, in fact, pragmatists resist to buy until they can compare. You will need two companies to put your product on the map like GPS satellites tell your position on the globe. One of them is the market alternative with established customer base you are after. This could be for example Microsoft SharePoint in document sharing. The other is the product alternative also harnessing the same disruptive innovation as your solution. This could be something like Dropbox with its ease of use and polished user experience. By pointing out that there is a product alternative you also weaken the position of the market alternative company. This is done by giving customers the notion that there is an undergoing paradigm shift in the marketplace. Your intent should be to acknowledge this new technology but to differentiate from the product alternative by virtue or your own segment focus.

Leaving the Chasm Behind

There is a sad revelation to make once you have successfully crossed the chasm. People who made this breakthrough possible most probably won’t carry you forward. Rockstar developers will want to work on the “next big thing” and salespeople who are able to sell to visionaries might not be as successful with conservatives. This transition won’t happen overnight but it will happen almost inevitably.

Personal Notes

I found two things very interesting in the book: subsequent editions after the first one haven’t been written because market dynamics have changed significantly. They work nearly the same way today as they were in 1991 when the book was first released. The reason for new editions was that companies come and go and the examples in the book had to be updated. Who heard of Lotus or Silicon Graphix? Will Facebook be still known in 10, 20 or 25 years time? Maybe, maybe not. It always fascinates me how fast paced our industry is.

The second thing is: the D-day type strategy might work withing a single enterprise just as well as in the wider market. I see a fractal pattern where departments inside a firm behave as adjacent market segments at a smaller scale. Departments working with less business critical applications will adopt any new technology first. Once it’s established at least in one department it will be easier to sell the technology to other department heads.

As a closure, I would like to recommend this book to developers and owners of small technology companies. A strategy that proved itself many times with big companies can be applied on a smaller scale as well. It might not be the same league but it surely is the same game!

5 Tricky Questions from Java

While I was studying for the Java 8 OCA exam I’ve come across some questions that really confused me the first time I saw them. Here is a short list of five tricky questions you might expect on an exam or during a job interview.

The Assignment Operator

What will happen when you try to compile and run the following pieces of code independently?

Snippet #1

int i = 1;
i = i + 2.5;
System.out.println(i);

Snippet #2

int i = 1;
i += 2.5;
System.out.println(i);
  1. Both result in compile time error
  2. Both compile successfully and print 3
  3. Both compile successfully and print 3.5
  4. First snippet fails to compile; the second one compiles and prints 3

Answer

The first snippet fails to compile because expression i + 2.5 results in a double value that cannot be stored in an integer variable. You might think that the second code snippet does the same thing. However, that’s not the case. The assignment operator += not just combines the assignment and addition in order to look cooler but also casts the result of the right-hand side if required – in our case back to integer. So, option D is correct. Please note 3 is printed instead of 3.5 as variable i is of type integer.

Indexing Arrays

Consider the following class:

class Test{
   public static void main(String[ ] args){
      int[] x = { 1, 2, 3, 4};
      int[] y = { 0, 1, 3};
      System.out.println( x [ (x = y)[2] ] );
   }
}

What will it print when compiled and run?

  1. It will throw ArrayIndexOutOfBoundsException
  2. It will print 4

Answer

When indexing arrays, the expression to the left of the brackets is fully evaluated before any part of the expression within the brackets is evaluated. This means that the original value of x is fetched and remembered while the expression (x = y) [2] is evaluated. So, option B is correct. Value 4 is printed out from the first array.

Null Reference

What do you think will be the result of trying to compile and run the following code?

class A {
     public static void print(Object obj) {
          System.out.println(obj);
     }
     public static void main(String[] args) {
          A a = null;
          a.print(a);
     }
}
  1. Compile error
  2. NullPointerException is thrown at runtime
  3. Compiles and runs without an issue and prints null

Answer

Option C is correct. Please note that the print method in class A is static. It doesn’t matter that variable a has a null value the method is executed based on the type of the variable.

Date Manipulation

What will be printed out by running the following code?

LocalDate date = LocalDate.of(2016, Month.AUGUST, 20); 
Period period = Period.ofMonths(1).ofDays(1); 
LocalDate past = date.minus(period); 
System.out.println(past.format(DateTimeFormatter.ISO_DATE));
  1. 19 Aug 2016
  2. 2016-08-19
  3. 2016-07-19

Answer

The trick here is that Period.ofMonths(int) and Period.ofDays(int) are static methods, therefore they cannot be chained. So, the period subtracted from the date represents only one day without the month. The correct answer is option B. Please notice the ISO_DATE format too. Option A would be correct if we used DateTimeFormatter.ofPattern("dd MMM YYYY").

String Pool

What will be the result of attempting to compile and run the following code?

if ("true".replace('T', 't') == "true" 
     System.out.println("true"); 
else
     System.out.println("false"); 
  1. It will print “true” 
  2. It will print “false”

Answer

When comparing two strings with the == operator we ask whether they are pointing to the same object. The two "true" literals on the first line represent the same object from the string pool. When we call the replace method on the first "true" it returns a String object in which all occurrences of the first parameter are replaced with the second parameter. However if there is no change to be made the reference to the same object is returned, therefore option A is correct – the code above prints "true". If we changed the first line to if ("true".replace('t', 'T') == "True") the code would print "false" because the "True" returned by the replace method would be a new String object.

These were my favorite tricky Java questions. Of course, there are tons of questions that can be asked. The 5 questions above are really just the tip of the iceberg. Anyway, I hope this helps you regardless whether you are preparing for a Java exam, your upcoming job interview, or you’re just a Java enthusiast like I am.