5 simple examples to understand Elasticsearch aggregation

Elasticsearch aggregation give us the ability to ask questions to our data. The ability to group and find out statistics (such as sum, average, min, max) on our data by using a simple search query. In this post, we will see some very simple examples to understand how powerful and easy it is to use Elasticsearch aggregation. I will also share a postman collection link at the bottom of this post in case you want to try out these queries on your own.

Let’s say we have a car store and to formulate some reports we are interested in the following:

  • What is the average price of sold cars having manufacturer Audi ?
  • Find all cars made by Ford and average price of ford cars sold in Jul 2020
  • What is the total price of all cars sold in Jul 2020 ?
  • Which are the most popular car manufacturers?
  • How much sales were made each month ?

So let’s get started right away and look at our sample data on which we will be performing aggregation.

{
  "_index": "cars",
  "_type": "_doc",
  "_id": "1",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A6",
    "price": 3900000,
    "sold_date": "2020-03-10"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "2",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Fiesta",
    "price": 580000,
    "sold_date": "2020-07-18"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "3",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A7",
    "price": 6500000,
    "sold_date": "2020-05-28"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "4",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A8",
    "price": 14900000,
    "sold_date": "2020-06-10"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "5",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Linea",
    "price": 420000,
    "sold_date": "2020-05-26"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "6",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Figo",
    "price": 480000,
    "sold_date": "2020-07-13"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "7",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Maruti",
    "model": "Swift",
    "price": 680000,
    "sold_date": "2020-05-25"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "8",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Tata",
    "model": "Altroz",
    "price": 680000,
    "sold_date": "2020-03-25"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "9",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Tata",
    "model": "Tigor",
    "price": 520000,
    "sold_date": "2020-07-25"
  }
}

What is the average price of sold cars having manufacturer Audi ?

From our sample data let’s find this manually

SnoCarPrice
1.Audi A63900000
2.Audi A76500000
3. Audi A814900000
Average 8433333.33

Now in order to find it using aggregation, we have to use the following query:

{
    "size": 0,
    "query" : {
      "match": {
        "manufacturer": "audi"
      }
    }, 
    "aggs": {
      "average_price": {
        "avg": {
          "field": "price"
        }
      }
    }
}

The query is very simple, we are just asking elasticsearch to first filter the records which have manufacturer Audi and then on all the records in the result do an average aggregation on the field price. Now lets see the response from elasticsearch:

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "average_price": {
      "value": 8433333.333333334
    }
  }
}

Wow! here we go, with such a simple query we are able to find the correct result. Let’s spice up things now and move to the next one.

Find all cars made by Ford and average price of ford cars sold in Jul 2020

Now this is an interesting one, we want to see all the cars which have manufacturer Ford, but we need average price of only those which are sold in month of Jul 2020.

All cars manufactured by Ford

S.NoCarPriceSold On
1Ford Fiesta58000018 Jul 2020
2Ford Linea42000026 May 2020
3Ford Figo48000013 Jul 2020

So there are total 3 cars made by Ford and the average price of cars sold in Jul 2020 is (580000 + 480000) / 2 = 530000

Now lets see what elasticsearch query we can use to get this result:

{
  "query": {
    "match": {
      "manufacturer": "ford"
    }
  },
  "aggs": {
    "recent_sales": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "2020-07-01", 
            "lte": "2020-07-31"
          }
        }
      },
      "aggs": {
        "average_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

If you compare this query with the first one, the only difference is that we have added one extra date filter inside the aggs block. This is how we can filter results before performing aggregation on them.

Also if you look carefully, there is one more difference i.e in the first query we have used a

"size": 0

parameter.

There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by setting size=0

You can verify this by looking at the response of the first query, in that only the aggregation result was returned and we do not see the actual documents which are used to evaluate that result. Now let’s see the result of our second query:

{
  "took": 61,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1.1451323,
    "hits": [
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Fiesta",
          "price": 580000,
          "sold_date": "2020-07-18"
        }
      },
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "5",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Linea",
          "price": 420000,
          "sold_date": "2020-05-26"
        }
      },
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "6",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Figo",
          "price": 480000,
          "sold_date": "2020-07-13"
        }
      }
    ]
  },
  "aggregations": {
    "recent_sales": {
      "doc_count": 2,
      "average_price": {
        "value": 530000.0
      }
    }
  }
}

Wohoo! The results are accurate again. Also all the cars made by Ford are returned in the response. Let’s move to the next one now.

What is the total price of all cars sold in Jul 2020 ?

Again, let’s solve this manually first:

S.NoCarSold DatePrice
1Ford Fiesta18 Jul 2020580000
2Ford Figo13 Jul 2020480000
3Tata Tigor25 Jul 2020520000
Total1580000

Let’s see what query we can use to solve this:

{
    "size": 0,
    "query" : {
      "range": {
        "sold_date": {
          "gte": "2020-07-01",
          "lte": "2020-07-31"
        }
      }
    },
    "aggs": {
      "total_price": {
        "sum": {
          "field": "price"
        }
      }
    }
}

Cool, a simple one. We simply apply a query range to filter out all cars sold in month of Jul 2020 and then we do a sum aggregation on them to find out the result. Let’s check the response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "total_price": {
      "value": 1580000.0
    }
  }
}

Awesome, its correct.

Which are the most popular car manufacturers?

For the sake of this article, let’s say the manufacturers which have sold most cars in last 3 months are considered to be popular.

S.NoCarTotal sold in last 3 months
1Ford3
2Audi2
3Maruti1
4Tata1

Now let’s see what query we can use to find out this result:

{
  "size": 0,
  "query": {
    "range": {
      "sold_date": {
        "from": "now-3M"
      }
    }
  },
  "aggs": {
    "group_by_make": {
      "terms": {
        "field": "manufacturer.keyword"
      }
    }
  }
}

Again the query is very simple, we first filter out all the cars which were sold in last 3 months and then we simply group them by their manufacturer. Cool, lets see the response now:

{
  "took": 86,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "group_by_make": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Ford",
          "doc_count": 3
        },
        {
          "key": "Audi",
          "doc_count": 2
        },
        {
          "key": "Maruti",
          "doc_count": 1
        },
        {
          "key": "Tata",
          "doc_count": 1
        }
      ]
    }
  }
}

Great, our query worked fine!

How much sales were made each month ?

This is a tricky one!

S.NoMonthTotal Sales
1Mar 20204580000
2Apr 20200
3May 20207600000
4Jun 202014900000
5Jul 20201580000

Using the below query:

{
  "size": 0,
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "sold_date",
        "interval": "month",
        "format": "MM-yyyy"
      },
      "aggs": {
        "monthly_sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    }
  }
}

Don’t worry if the query looks complex. We are just doing nested aggregation. First we group all the data in monthly buckets and then on each bucket we perform a metric aggregation to sum the price. Think of this as a pipeline of aggregations.

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 9,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "sales_over_time": {
      "buckets": [
        {
          "key_as_string": "03-2020",
          "key": 1583020800000,
          "doc_count": 2,
          "monthly_sales": {
            "value": 4580000.0
          }
        },
        {
          "key_as_string": "04-2020",
          "key": 1585699200000,
          "doc_count": 0,
          "monthly_sales": {
            "value": 0.0
          }
        },
        {
          "key_as_string": "05-2020",
          "key": 1588291200000,
          "doc_count": 3,
          "monthly_sales": {
            "value": 7600000.0
          }
        },
        {
          "key_as_string": "06-2020",
          "key": 1590969600000,
          "doc_count": 1,
          "monthly_sales": {
            "value": 1.49E7
          }
        },
        {
          "key_as_string": "07-2020",
          "key": 1593561600000,
          "doc_count": 3,
          "monthly_sales": {
            "value": 1580000.0
          }
        }
      ]
    }
  }
}

Wow! that was accurate and definitely not very difficult to achieve.

All right that was it for this post. Hope you had fun and experienced power of Elasticsearch aggregations. If you enjoyed the post, please like and share it so that it also reaches other valuable readers. If you have any doubts or feedbacks, please scroll to the bottom and leave a comment. Again, thanks for reading 🙂

Bonus:

Postman collection link: https://www.postman.com/collections/201d2f5fea372d02fc55

Tagged : / / / /

Be careful while querying inner objects in elasticsearch

In elasticsearch we can store closely related entities within a single document. For example, we can store a blog post and all of its comments together, by passing an array of comments.

{
  "title": "Invest Money",
  "body": "Please start investing money as soon...",
  "tags": ["money", "invest"],
  "published_on": "18 Oct 2017",
  "comments": [
    {
      "name": "William",
      "age": 34,
      "rating": 8,
      "comment": "Nice article..",
      "commented_on": "30 Nov 2017"
    },
    {
      "name": "John",
      "age": 38,
      "rating": 9,
      "comment": "I started investing after reading this.",
      "commented_on": "25 Nov 2017"
    },
    {
      "name": "Smith",
      "age": 33,
      "rating": 7,
      "comment": "Very good post",
      "commented_on": "20 Nov 2017"
    }
  ]
}

So we have an elasticsearch document describing a post and an inner object comments containing all the comments on a post. But inner objects in elasticsearch do not work as we expect. How ? We will see it soon.

PROBLEM

Now suppose we want to find all blog posts on which user {name: john, age: 34} has commented. So lets again look at our sample document above and find the users who had commented.

nameage
William34
John38
Smith33

From the list we can clearly see that there is no user John of 34 years age. For simplicity consider we have only 1 document in elasticsearch index. Lets verify the same by querying the index:

curl -XGET 'localhost:9200/blog/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "comments.name": "John" }},
        { "match": { "comments.age":  34 }}
      ]
    }
  }
}

Our sample document is returned in response. Surprised ?. Now that is why I said:

inner objects in elasticsearch do not work as expected

The problem here is that the library used by elasticsearch(lucene) has no concept of inner objects, so as a result inner objects are flattened into a simple list of field name and values. Our document is internally stored as:

{
  "title":                    [ invest, money ],
  "body":                     [ as, investing, money, please, soon, start ],
  "tags":                     [ invest, money ],
  "published_on":             [ 18 Oct 2017 ]
  "comments.name":            [ smith, john, william ],
  "comments.comment":         [ after, article, good, i, investing, nice, post, reading, started, this, very ],
  "comments.age":             [ 33, 34, 38 ],
  "comments.rating":          [ 7, 8, 9 ],
  "comments.commented_on":    [ 20 Nov 2017, 25 Nov 2017, 30 Nov 2017 ]
}

As you can clearly see above that the relationship between comments.name and comments.age has been lost. So that is why our document matches a query for john and 34.

SOLUTION

To solve this problem we just need to make a small change in mapping of elasticsearch. If you have a look at the mapping of index you will find that the type of comments field is object. We need to update it to type nested.

We can simply update the mapping of our index by running the below query:

curl -XPUT 'localhost:9200/blog' -d'
{
  "mappings": {
    "blog": {
      "properties": {
        "title": { "type": "string" },
        "body": { "type": "string" },
        "tags": { "type": "text" },
        "published_on": { "type": "text" },
        "comments": {
          "type": "nested",
          "properties": {
            "name":    { "type": "string"  },
            "comment": { "type": "string"  },
            "age":     { "type": "short"   },
            "rating":   { "type": "short"   },
            "commented_on":    { "type": "text"    }
          }
        }
      }
    }
  }
}

After changing the mapping to type nested, there is a slight change in the way we can query the index. We need to use nested query. Given below is the nested query example:

curl -XGET 'localhost:9200/blog/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "comments",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "comments.name": "john"
                    }
                  },
                  {
                    "match": {
                      "comments.age": 34
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

The above query will return no document in response as there is no match of user {name: john, age: 34}.

Surprised again ? Just a small change solved a problem in no time. It may be a smaller change from our side, but a lot has changed in the way elasticsearch stores our document. Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others.

Given below is the internal representation of sample document after changing mapping:

{
  {
    "comments.name":    [ john ],
    "comments.comment": [ after i investing started reading this ],
    "comments.age":     [ 38 ],
    "comments.rating":  [ 9 ],
    "comments.date":    [ 25 Nov 2017 ]
  },
  {
    "comments.name":    [ william ],
    "comments.comment": [ article, nice ],
    "comments.age":     [ 34 ],
    "comments.rating":   [ 8 ],
    "comments.date":    [ 30 Nov 2017 ]
  },
  {
    "comments.name":    [ smith ],
    "comments.comment": [ good, post, very],
    "comments.age":     [ 33 ],
    "comments.rating":   [ 7 ],
    "comments.date":    [ 20 Nov 2017 ]
  },
  {
    "title":            [ invest, money ],
    "body":             [ as, investing, money, please, soon, start ],
    "tags":             [ invest, money ],
    "published_on":     [ 18 Oct 2017 ]
  }
}

As you can see each inner object is stored as a separate hidden document internally. This maintains the relationship between their fields.

CONCLUSION:

So if you are using inner objects in index and querying them too, verify that the type of inner object is nested. Else the query may return invalid result documents.

Thanks for reading. Please like and share so that it can reach out to other valuable readers too.

Tagged : /
%d bloggers like this: