Author Avatar

Pradeep Mishra

2

Share post:

Introduction

Elasticsearch is the living heart of what is today’s the most popular log analytics platform — the ELK Stack (Elasticsearch, Logstash and Kibana). Elasticsearch’s role is so central that it has become synonymous with the name of the stack itself. Primarily for search and log analysis, The Stack Overflow Developer Survey 2020 has placed Elasticsearch in the top 10 most popular database for 2020.

Elasticsearch is a formidable competitor to Apache Solr, as well as commercial search engines such as Splunk and other log analytics engines.

When people ask, “what is Elasticsearch?”, some may answer that it’s “an index”, “a search engine”, an “analytics database”, “a big data solution”, that “it’s fast and scalable”, or that “it’s kind of like Google”. Depending on your level of familiarity with this technology, these answers may either bring you closer to an ah-ha moment or further confuse you. But the truth is, all of these answers are correct and that’s part of the appeal of Elasticsearch. Over the years, Elasticsearch and the ecosystem of components that’s grown around it called the “Elastic Stack” has been used for a growing number of use cases, from simple search on a website or document, collecting and analyzing log data, to a business intelligence tool for data analysis and visualization. So how did a simple search engine created by Elastic co-founder Shay Bannon for his wife’s cooking recipes grow to become today’s most popular enterprise search engine and one of the 10 most popular DBMS?. We’ll answer that in this post by understanding what Elasticsearch is, how it works, and why it is used.

What is Elasticsearch?

Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene and developed in Java. It started as a scalable version of the Lucene open-source search framework then added the ability to horizontally scale Lucene indices. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index. It uses a structure based on documents instead of tables and schemas and comes with extensive REST APIs for storing and searching the data. At its core, you can think of Elasticsearch as a server that can process JSON requests and give you back JSON data.

Apache Lucene is a free and open-source search engine software library, originally written completely in Java by Doug Cutting.

How does Elasticsearch work?

Elasticsearch works by retrieving and managing document-oriented and semi-structured data. Internally, the basic principle of how Elasticsearch works is the “shared nothing” architecture. The primary data structure Elasticsearch uses is an inverted index managed using Apache Lucene’s APIs.

In very simple terms, an inverted index is a mapping of each unique ‘word’ (token) to the list of documents (locations) containing that word, which makes it possible to locate documents with given keywords very quickly. Index information is stored in one or multiple partitions also called shards. Elasticsearch is able to distribute and allocate shards dynamically to the nodes in a cluster, as well as replicate them.

This mechanism makes it flexible with regard to data distribution. Redundancy can be provided by distributing replica shards (‘copies’ of the primary shards) to different cluster nodes. Index operations use primary shards and search queries use both shard types. Having multiple nodes and replicas increases query performance.

Understanding the Basic Concepts of Elasticsearch

Let’s take a look at the basic concepts of Elasticsearch, from index to clusters, indexes, nodes, shards, mapping, and more:

JVM

Elasticsearch is written in Java and thus uses the Java Virtual Machine (JVM). The JVM is a runtime engine that executes bytecode on many operating system platforms.

Index

An index is a collection of documents that often have a similar structure and is used to store and read documents from it. It’s the equivalent of a database in RDBMS (relational database management system). The index is identified by a unique index name that you will refer to whenever you perform search, update or delete actions.

Using an inverted index is a lot like searching for a book page that contains a certain keyword by scanning the index at the back of the book instead of scanning every page from beginning to end. This inverted index enables Elasticsearch to retrieve data quickly and efficiently.

In terms of data modeling, it could be compared to a collection in MongoDB or CouchDB. A single index can hold one data type, with its own data structure, while in a cluster you can have more than one index. The schema is defined by the Mapping. An index is built from 1-N primary shards, which can have 0-N replica shards.

Shard

A shard is a subset of documents of an index. Elasticsearch uses shards when the volume of data stored in your cluster exceeds the limits of your server. Therefore, it allows you to split your index into smaller pieces called shards. A shard is a single Lucene index instance. Elasticsearch has two types of shards:

  • primary shards, or active shards that hold the data
  • replica shards, or copies of the primary shard
Mapping 

A mapping is the schema definition for the index. Extending the mapping with new fields or adding sub-fields is possible at any time, but changing the type of fields is a more complex operation including re-indexing of the data.

When no mapping is defined, Elasticsearch tries to detect the type of field (String, Number, IP, Geo-Point) automatically. It creates an automatic mapping for the data type and sets default analyzers for strings and adds the “keyword” sub-field (not analyzed). By default you get a string mapped as both text and a keyword sub-field. So you can do full-text search on one hand, and exact matches, sorting and aggregations on the other.

It’s important to define the correct mapping to avoid problems at query time. For example, you want to avoid having Elasticsearch identify some field as Number and then later try indexing data that, in that same field now contains a string. Trying to index such data will fail.

Segments

A segment is a Lucene-level concept. They represent chunks of a shard (Lucene Index). Each Lucene index contains one or more segments. While this is a Lucene-level thing, Elasticsearch does offer knobs to manage segment sizes and how you configure that will have an impact on Elasticsearch indexing performance.

Document 

A document is the main and basic unit of information entity in Elasticsearch and is represented in JSON (JavaScript Object Notation) format. Documents can be stored and indexed. An index has one or more documents and a document has one or more fields. The original is represented as “_source” in the API besides the actual indexed fields of a document.

Search is only possible against indexed fields and retrieving the original field content is only possible in fields defined as “stored” in the Mapping (aside from the mentioned “_source” object that holds the complete document values).
For efficient field-based display, the stored flag should be set when the “_source” objects are large – this can reduce network traffic and speed up the display of results. In RDBMS terms, a document is a row.

Node

A node is a single instance of Elasticsearch process. It’s a server that stores data and is a part of the cluster’s indexing and searching functions. Nodes discover each other in the cluster by their shared cluster name.

A cluster can have multiple nodes depending on the node configuration, multicast or unicast discovery is used. Multiple nodes can run on a single physical server, VM, or container. The two main node types are data nodes and master nodes. Nodes can be configured to hold data or act as cluster master nodes, or both.

Cluster

A cluster consists of one or more nodes (servers) that store all the data and provides indexing and searching capabilities across all nodes. Each cluster has a single active master node, which is automatically elected (e.g., when the current master node fails).

Replica

A replica is a mechanism that Elasticsearch uses to handle failures such as a node going offline, without losing data. It’s a copy of the primary shard and can be used for searches just as the original shard.

Why Use Elasticsearch?

Built with Java, this datastore allows you to run it on any platform. Compared to most NoSQL databases, Elasticsearch is much more focused on the search functionalities, equipped with a rich and powerful HTTP RESTful API that enables you to perform fast searches in near real time.

To fully understand how Elasticsearch can help you and why you may want to use it, here are some of its main capabilities:

Full Text Search Engine

Traditional SQL database management systems are not designed for full-text searches against large volumes of data. Because it’s built on top of Lucene, Elasticsearch offers one of the most powerful full-text search capabilities and lets you perform and combine many types of searches, from structured, unstructured, geo, to metric.

Analytical Engine

The analytical use case is the most popular Elasticsearch use case, even more popular than full text search. Specifically, Elasticsearch is often used for log analytics, slicing and dicing of numerical data such as application and infrastructure performance metrics. Although Apache Solr provided faceting before Elasticsearch was even born, Elasticsearch took faceting to another level, enabling its users to aggregate data on the fly using Elasticsearch’s aggregation queries. These aggregation queries are what powers pretty much all data visualizations you see in tools like Kibana, Grafana, and others.

Distributed Architecture Designed for Scaling

Elasticsearch was built to scale from the beginning. Its distributed architecture allows you to scale Elasticsearch to a lot of servers and accommodate petabytes of data.

Distributed systems are complex, but Elasticsearch makes many decisions automatically and provides a good management API. Scaling Elasticsearch is, therefore, much easier than with many other systems, though large Elasticsearch clusters come with their set of issues and often require Elasticsearch expertise. Elasticsearch can also replicate data automatically to prevent data loss in case of node failures.

Effective Investment Right Out of The Box

Elasticsearch’s mechanics are quite easy to grasp, at least when one is dealing with a relatively small dataset or small deployment. Its simple RESTful APIs work with ingestion tools such as Logstash to send data to Elasticsearch as JSON documents, or Kibana to build reports and visualize your data. 

These capabilities along with a short learning curve, allow you to quickly start working on use cases and become more productive.

Rich Ecosystem

One of the main reasons for Elasticsearch rise in popularity is its well-documented API. The availability of this API made it possible for developers to integrate with it and over time that is exactly what they did. Virtually every log shipper or logging library have adapters for sending data to Elasticsearch. Logstash may be the most popular one, but there are many others.

Besides various tools that can ingest data into Elasticsearch via its API, there are also tools like Kibana or Grafana aimed at Elasticsearch data exploration, analysis, and visualization.

Compatible with Many Languages

Elasticsearch has client libraries for many programming languages such as Java, JavaScript, PHP, C#, Ruby, Python, Go, and many more. Availability of these client libraries makes it quite easy for developers to integrate with Elasticsearch.

Elasticsearch Use Cases & Applications Examples

As a distributed engine, Elasticsearch is highly scalable and offers near real-time search capabilities. This adds up to a solution that can do more than a search engine and supports a multitude of growing critical business needs and operational use cases.

Generally, thanks to its powerful search capabilities, Elasticsearch is used as the underlying technology that powers applications with complex search features and requirements. From numbers, text, geo, structured, unstructured, Elasticsearch supports all data types.

Elasticsearch is popular due to its versatile nature in handling data and being paired with other tools. Companies like Wikipedia, Github, NY Times or Facebook all use Elasticsearch for various use cases: from easy search for all 164 years of published articles to instantaneous live chat or seamless e-commerce experience, any business that needs to serve information in a fast way can put Elasticsearch to good use.

With pretty much endless and versatile capabilities that continue to grow and change depending on business goals, here’s how businesses have used Elasticsearch for different use cases:

Instantaneous E-commerce Search Across Retail Product Catalogues

Retailers are using Elasticsearch to index their product catalogs and inventory, alongside all the product attributes, so when the clients search for a specific product attribute, their store can display the right products instantly.

A near instant search bar can boost revenue by delivering a better product catalog search experience and make search the primary form of navigation.

Walgreens and Kreeger are some of the biggest retail companies streamlining their online grocery shopping experience with Elasticsearch.

Operational Logging Analytics

Using Elasticsearch to process billions of events every day to analyze logs and ensure consistent system performance or detect anomalies helped companies like GoDaddy to improve customer experience and enhance the user experience.

Site Content and Media Search

Engadget and The New York Times are using Elasticsearch for site content search to better understand what their users are searching for and why – all with the goal to improve their user engagement KPIs.

Using Elasticsearch for site content search is not limited to publishers – Shopify and Asana also use it to make their documentation and support content easily findable to clients. Search is also not limited to articles. One of the biggest video hosting companies, Vimeo, powers the search of millions of videos every day through Elasticsearch.

Instantaneous Live Chat

Live Chat is one company that improved the customer experience for 6,000 customers conducting millions of queries daily – all by using Elasticsearch to maintain an archive of 460 million documents and deliver instantaneous query response times.

Fraud Monitoring and Early Detection

SoftBank and Xoom are preventing and protecting against fraud and security threats by monitoring their system with Elasticsearch.

Application Search

One of the biggest companies using Elasticsearch for application search is eBay, searching across 800 million listings in subseconds and maintaining a world-class end-user experience for millions of people every day.

Business Analytics

Walmart is using Elasticsearch to gain insights into customer purchasing patterns and store performance metrics, in order to enhance the in-store and online retail customer shopping experience and boost their commercial success.

Enterprise Search

Facebook uses Elasticsearch and has gone from a simple enterprise search to over 40 tools across multiple clusters with 60+ million queries a day and growing.

Metrics Analytics

Sprint is using Elasticsearch to analyze over 200 dashboards, representing 3 billion events per day from logs, databases, emails, syslogs, test messages, and internal and vendor application APIs, in order to search for better retail operations insights.

Security Analytics

Slack is building a defensive security program to monitor malicious activity by using Elasticsearch. Cisco is also using Elasticsearch to leverage data to detect and defeat hackers and fight cyber threats.

Scraping and Analyzing Public Data

Public data like social media conversations can be mined by using Elasticsearch to do real-time analysis, resulting in a social sentiment analysis to understand customers.

I hope you have enjoyed this post and it helped you to understand about Elasticsearch and how it works. Please like and share and feel free to comment if you have any suggestions or feedback.

Nginx as reverse proxy and IP resolution
Criticism of Design Pattern

Discussion

Leave a Reply