Author Avatar

Anuj Verma

0

Share post:

Elasticsearch aggregation give us the ability to ask questions to our data. The ability to group and find out statistics (such as sum, average, min, max) on our data by using a simple search query. In this post, we will see some very simple examples to understand how powerful and easy it is to use Elasticsearch aggregation. I will also share a postman collection link at the bottom of this post in case you want to try out these queries on your own.

Let’s say we have a car store and to formulate some reports we are interested in the following:

  • What is the average price of sold cars having manufacturer Audi ?
  • Find all cars made by Ford and average price of ford cars sold in Jul 2020
  • What is the total price of all cars sold in Jul 2020 ?
  • Which are the most popular car manufacturers?
  • How much sales were made each month ?

So let’s get started right away and look at our sample data on which we will be performing aggregation.

{
  "_index": "cars",
  "_type": "_doc",
  "_id": "1",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A6",
    "price": 3900000,
    "sold_date": "2020-03-10"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "2",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Fiesta",
    "price": 580000,
    "sold_date": "2020-07-18"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "3",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A7",
    "price": 6500000,
    "sold_date": "2020-05-28"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "4",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Audi",
    "model": "A8",
    "price": 14900000,
    "sold_date": "2020-06-10"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "5",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Linea",
    "price": 420000,
    "sold_date": "2020-05-26"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "6",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Ford",
    "model": "Figo",
    "price": 480000,
    "sold_date": "2020-07-13"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "7",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Maruti",
    "model": "Swift",
    "price": 680000,
    "sold_date": "2020-05-25"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "8",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Tata",
    "model": "Altroz",
    "price": 680000,
    "sold_date": "2020-03-25"
  }
},
{
  "_index": "cars",
  "_type": "_doc",
  "_id": "9",
  "_score": 1.0,
  "_source": {
    "manufacturer": "Tata",
    "model": "Tigor",
    "price": 520000,
    "sold_date": "2020-07-25"
  }
}

What is the average price of sold cars having manufacturer Audi ?

From our sample data let’s find this manually

SnoCarPrice
1.Audi A63900000
2.Audi A76500000
3. Audi A814900000
Average 8433333.33

Now in order to find it using aggregation, we have to use the following query:

{
    "size": 0,
    "query" : {
      "match": {
        "manufacturer": "audi"
      }
    }, 
    "aggs": {
      "average_price": {
        "avg": {
          "field": "price"
        }
      }
    }
}

The query is very simple, we are just asking elasticsearch to first filter the records which have manufacturer Audi and then on all the records in the result do an average aggregation on the field price. Now lets see the response from elasticsearch:

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "average_price": {
      "value": 8433333.333333334
    }
  }
}

Wow! here we go, with such a simple query we are able to find the correct result. Let’s spice up things now and move to the next one.

Find all cars made by Ford and average price of ford cars sold in Jul 2020

Now this is an interesting one, we want to see all the cars which have manufacturer Ford, but we need average price of only those which are sold in month of Jul 2020.

All cars manufactured by Ford

S.NoCarPriceSold On
1Ford Fiesta58000018 Jul 2020
2Ford Linea42000026 May 2020
3Ford Figo48000013 Jul 2020

So there are total 3 cars made by Ford and the average price of cars sold in Jul 2020 is (580000 + 480000) / 2 = 530000

Now lets see what elasticsearch query we can use to get this result:

{
  "query": {
    "match": {
      "manufacturer": "ford"
    }
  },
  "aggs": {
    "recent_sales": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "2020-07-01", 
            "lte": "2020-07-31"
          }
        }
      },
      "aggs": {
        "average_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

If you compare this query with the first one, the only difference is that we have added one extra date filter inside the aggs block. This is how we can filter results before performing aggregation on them.

Also if you look carefully, there is one more difference i.e in the first query we have used a

"size": 0

parameter.

There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by setting size=0

You can verify this by looking at the response of the first query, in that only the aggregation result was returned and we do not see the actual documents which are used to evaluate that result. Now let’s see the result of our second query:

{
  "took": 61,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1.1451323,
    "hits": [
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Fiesta",
          "price": 580000,
          "sold_date": "2020-07-18"
        }
      },
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "5",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Linea",
          "price": 420000,
          "sold_date": "2020-05-26"
        }
      },
      {
        "_index": "cars",
        "_type": "_doc",
        "_id": "6",
        "_score": 1.1451323,
        "_source": {
          "manufacturer": "Ford",
          "model": "Figo",
          "price": 480000,
          "sold_date": "2020-07-13"
        }
      }
    ]
  },
  "aggregations": {
    "recent_sales": {
      "doc_count": 2,
      "average_price": {
        "value": 530000.0
      }
    }
  }
}

Wohoo! The results are accurate again. Also all the cars made by Ford are returned in the response. Let’s move to the next one now.

What is the total price of all cars sold in Jul 2020 ?

Again, let’s solve this manually first:

S.NoCarSold DatePrice
1Ford Fiesta18 Jul 2020580000
2Ford Figo13 Jul 2020480000
3Tata Tigor25 Jul 2020520000
Total1580000

Let’s see what query we can use to solve this:

{
    "size": 0,
    "query" : {
      "range": {
        "sold_date": {
          "gte": "2020-07-01",
          "lte": "2020-07-31"
        }
      }
    },
    "aggs": {
      "total_price": {
        "sum": {
          "field": "price"
        }
      }
    }
}

Cool, a simple one. We simply apply a query range to filter out all cars sold in month of Jul 2020 and then we do a sum aggregation on them to find out the result. Let’s check the response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "total_price": {
      "value": 1580000.0
    }
  }
}

Awesome, its correct.

Which are the most popular car manufacturers?

For the sake of this article, let’s say the manufacturers which have sold most cars in last 3 months are considered to be popular.

S.NoCarTotal sold in last 3 months
1Ford3
2Audi2
3Maruti1
4Tata1

Now let’s see what query we can use to find out this result:

{
  "size": 0,
  "query": {
    "range": {
      "sold_date": {
        "from": "now-3M"
      }
    }
  },
  "aggs": {
    "group_by_make": {
      "terms": {
        "field": "manufacturer.keyword"
      }
    }
  }
}

Again the query is very simple, we first filter out all the cars which were sold in last 3 months and then we simply group them by their manufacturer. Cool, lets see the response now:

{
  "took": 86,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "group_by_make": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Ford",
          "doc_count": 3
        },
        {
          "key": "Audi",
          "doc_count": 2
        },
        {
          "key": "Maruti",
          "doc_count": 1
        },
        {
          "key": "Tata",
          "doc_count": 1
        }
      ]
    }
  }
}

Great, our query worked fine!

How much sales were made each month ?

This is a tricky one!

S.NoMonthTotal Sales
1Mar 20204580000
2Apr 20200
3May 20207600000
4Jun 202014900000
5Jul 20201580000

Using the below query:

{
  "size": 0,
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "sold_date",
        "interval": "month",
        "format": "MM-yyyy"
      },
      "aggs": {
        "monthly_sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    }
  }
}

Don’t worry if the query looks complex. We are just doing nested aggregation. First we group all the data in monthly buckets and then on each bucket we perform a metric aggregation to sum the price. Think of this as a pipeline of aggregations.

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 9,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "sales_over_time": {
      "buckets": [
        {
          "key_as_string": "03-2020",
          "key": 1583020800000,
          "doc_count": 2,
          "monthly_sales": {
            "value": 4580000.0
          }
        },
        {
          "key_as_string": "04-2020",
          "key": 1585699200000,
          "doc_count": 0,
          "monthly_sales": {
            "value": 0.0
          }
        },
        {
          "key_as_string": "05-2020",
          "key": 1588291200000,
          "doc_count": 3,
          "monthly_sales": {
            "value": 7600000.0
          }
        },
        {
          "key_as_string": "06-2020",
          "key": 1590969600000,
          "doc_count": 1,
          "monthly_sales": {
            "value": 1.49E7
          }
        },
        {
          "key_as_string": "07-2020",
          "key": 1593561600000,
          "doc_count": 3,
          "monthly_sales": {
            "value": 1580000.0
          }
        }
      ]
    }
  }
}

Wow! that was accurate and definitely not very difficult to achieve.

All right that was it for this post. Hope you had fun and experienced power of Elasticsearch aggregations. If you enjoyed the post, please like and share it so that it also reaches other valuable readers. If you have any doubts or feedbacks, please scroll to the bottom and leave a comment. Again, thanks for reading 🙂

Bonus:

Postman collection link: https://www.postman.com/collections/201d2f5fea372d02fc55

Simple and elegant Vim IDE setup for Go
10 must know Vim commands for beginners

Leave a Reply