Elasticsearch Series (2) – Basic Concepts of ES
This chapter will share with you some basic concepts of Elasticsearch.
This chapter will share with you some basic concepts of Elasticsearch. Without further ado, let’s get straight to the topic.
1. What is Lucene
Lucene is Apache’s open source search engine library, which provides the core API of search engines.
1. Advantages of Lucene: easy expansion, high performance (based on inverted index)
2. Disadvantages of Lucene: limited to Java language development, steep learning curve, and does not support horizontal expansion
2. What is Elasticsearch
Elasticsearch (ES for short) is an open source, distributed full-text search and analysis engine. It can help us quickly find what we need from massive data.
1. Elasticsearch is developed based on Lucene. Compared with Lucene, Elasticsearch has the following advantages: it supports distribution and can be horizontally expanded; it provides a Restful interface and can be called by any language.
2. Elasticsearch combines Kibana, Logstash, and Beats, which is elastic stack (ELK). It is widely used in log data analysis, real-time monitoring and other fields.
3. Elasticsearch is the core of elastic stack (ELK) and is responsible for storing, searching, and analyzing data.
4. Official website address: https://www.elastic.co/cn/
3. What is elastic stack (ELK)
It is a technology stack with Elasticsearch as the core, including Beats, Logstash, Kibana, and Elasticsearch.
4. Forward index and inverted index
1. What is forward index
Forward index: Create an index based on document id. When querying for terms, you must first find the document and then determine whether it contains the terms.
Traditional databases (such as MySQL) use forward indexes, for example, create an index for the id in the following table (tb_goods):
This is our traditional forward index. If the retrieval is done through the index id, the efficiency is relatively high, but if the retrieval is done through local content, the efficiency is relatively poor.
2. What are documents and entries
Document: Each piece of data is a document
Term: Semantically segment the content in the document, and the resulting words are terms
3. What is an inverted index
Inverted index: Segment the document content into words, create an index for the terms, and record the information of the document where the terms are located. When querying, first query the document ID according to the entry, and then obtain the document.
Elasticsearch uses inverted index:
Document: Each piece of data is a document
Term: The document is divided into words based on semantics
For example: Create an inverted index on title
When the inverted index is stored, it divides the content in the document into different terms according to semantics, and then stores them according to the terms, associates the document id, and establishes the inverted index.
The search process is as follows:
The search process of the inverted index is retrieved twice. The first time is to search in the entry list based on the entry entered by the user to find the corresponding document id. The second time is to take the document id. Go find the documentation. Although it has been searched twice, each search is an index-level search, so the overall query efficiency is relatively high.
5. Basic concepts of ES
1. Field
Field: similar to a field in MySQL.
2. Document
Document: a piece of data, expressed in json format. Elasticsearch is oriented to document storage, which can be a piece of product data or order information in the database. Document data will be serialized into json format and stored in Elasticsearch.
3. Type
Type: A concept that is gradually weakened. Type is like tables in the relational database MySQL, such as user tables, product tables, etc. Note: In Elasticsearch7.x, there is only one table under an index library. When creating a table, you cannot give the table a name. The default table name is _doc.
4. Index (Index)
Index: A collection of documents of the same type, similar to a table in a MySQL database. (Since Elasticsearch version 7.x, the Type type has been deleted, and the default Type is _doc, otherwise the index library in ES is more like the concept of a database in MySQL)
5. Mapping
Mapping: field constraint information of documents in the index, similar to table structure constraints
6. Query DSL
Query DSL: ES provides a powerful way to retrieve data, which is called Query DSL (Domain Specific Language). DSL is a JSON-style request statement provided by Elasticsearch, which is used to operate Elasticsearch and implement CRUD.
7. Sharding (shard)
Sharding: The data in an Index can be divided into multiple shards and then stored on multiple servers to increase the amount of data that can be stored in an Index, accelerate retrieval capabilities, and improve system performance.
8. Replica
Copy: The data stored in the shard is the same and serves as a backup. When the shard fails, data can be read from the replica to ensure that the system is not affected.
9. Node
Node: A single Elasticsearch instance, a machine can have multiple nodes. Node names are randomly assigned by default.
10. Cluster
Cluster: A group of Elasticsearch instances. The default cluster name is elasticsearch.
11. Concept comparison
MySQL | Elasticsearch | Description |
Table | Index | Index is a collection of documents, similar to a database table. (Since Elasticsearch version 7.x, the Type type has been deleted, and the default Type is _doc, otherwise the index library in ES is more like the concept of a database in MySQL) |
Row | Document | Document is a piece of data, similar to a row in a database. Documents are all in JSON format. |
Column | Field | Field is a field in a JSON document, similar to a column in a database. |
Schema | Mapping | Mapping is a constraint on documents in the index, such as field type constraints, similar to the table structure (Schema) of a database. |
SQL | DSL | ES provides a powerful way to retrieve data, which is called Query DSL (Domain Specific Language). DSL is a JSON-style request statement provided by Elasticsearch, which is used to operate Elasticsearch and implement CRUD. |
12. The relationship between Elasticsearch and database
MySQL: Good at transaction type operations and can ensure data security and consistency.
Elasticsearch: Good at searching, analyzing, and calculating massive data.
13. Metadata (Document MetaData)
Metadata: used to annotate relevant information of the document.
1) _index: the index name where the document is located
2) _type: the type name of the document, the default is _doc
3) _id: unique ID of the document
4)_source: stores the original document, the original json data of the document, and the content of each field can be obtained from here
5) _all: Integrate all field contents into this field, disabled by default
6)_score: score
7)_version: version
8) _seq_no: sequence number
6. Data types in ES
1. Core data types
1) String type: text (word-separable text), keyword (exact value, such as brand, country, IP address)
2) Numeric types: long, integer, short, byte, double, float, half_float, scaled_float
3) Date type: date
4) Boolean type: boolean
5) Binary type: binary
6) Range types: integer_range, float_range, long_range, double_range, date_range
2. Complex data types
1) Array type: array
2) Object type: object
3) Nested type: nested object
3. Geographical location data type
1) geo_point
2) geo_shape
4. Special type
1) Record the ip address: ip
2) Implement automatic completion: completion
3) Record the number of word segments: token_count
4) Record the string hash value: murmur3
This article is carefully written by the blogger. Please keep the original link for reprinting: https:/ /www.cnblogs.com/xyh9039/p/17842159.html
Copyright Statement: Any similarity is purely coincidental. If there is any infringement, please contact me in time to make corrections. Thanks! ! !