Elasticsearch 索引创建 / 数据检索

qingmoucsdn

2019-07-01

es 6.0 开始不推荐一个index下多个type的模式，并且会在 7.0 中完全移除。在 6.0 的index下是无法创建多个type的，type带来的字段类型冲突和检索效率下降的问题，导致了type会被移除。（5.x到6.x）
_all字段也被舍弃了，使用 copy_to自定义联合字段。（5.x到6.x）
type:text/keyword 来决定是否分词，index: true/false决定是否索引（2.x到5.x）
analyzer来单独设定分词器（2.x到5.x）

创建索引

先把 ik 装上，重启服务。

# 使用 elasticsearch-plugin 安装
elasticsearch-plugin install \
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

文档字段类型参考：
https://www.elastic.co/guide/...

文档字段其他参数参考（不同字段类型可能会有相应的特征属性）：
https://www.elastic.co/guide/...

我们新建一个名news的索引：

设定默认分词器为ik分词器用来处理中文
使用默认名 _doc 定义 type
故意关闭_source存储（用来验证 store 选项）
title 不存储 author 不分词 content 存储

_source字段的含义可以看下这篇博文：https://blog.csdn.net/napoay/...

PUT /news
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "index": {
            "analysis.analyzer.default.type" : "ik_smart"
        }
    },
    "mappings": {
        "_doc": {
            "_source": {
                "enabled": false
            },
            "properties": {
                "news_id": {
                    "type": "integer",
                    "index": true
                },
                "title": {
                    "type": "text",
                    "store": false
                },
                "author": {
                    "type": "keyword"
                },
                "content": {
                    "type": "text",
                    "store": true
                },
                "created_at": {
                    "type": "date",
                    "format": "yyyy-MM-dd hh:mm:ss"
                }
            }
        }
    }
}
# 查看创建的结构
GET /news/_mapping

验证分词器是否生效

# 验证分词插件是否生效
GET /_analyze
{
    "analyzer": "ik_smart",
    "text": "我热爱祖国"
}
GET /_analyze
{
    "analyzer": "ik_max_word",
    "text": "我热爱祖国"
}

# 索引的默认分词器
GET /news/_analyze
{
    "text": "我热爱祖国！"
}

# 指定字段 分词器将根据字段属性做相应分词处理
# author 为 keyword 是不会做分词处理
GET /news/_analyze
{
    "field": "author"
    "text": "我热爱祖国！"
}
# title 的分词结果
GET /news/_analyze
{
    "field": "title"
    "text": "我热爱祖国！"
}

添加文档

用于演示，后面的查询会以这些文档为例。

POST /news/_doc
{
    "news_id": 1,
    "title": "我们一起学旺叫",
    "author": "才华横溢王大猫",
    "content": "我们一起学旺叫，一起旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 2,
    "title": "我们一起学猫叫",
    "author": "王大猫不会被分词",
    "content": "我们一起学猫叫，还是旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 3,
    "title": "实在编不出来了",
    "author": "王大猫",
    "content": "实在编不出来了，随便写点数据做测试吧，旺旺旺",
    "created_at": "2019-03-26 11:55:20"
}

检索数据

GET /news/_doc/_search 为查询news下_doc的文档的接口，我们用 restApi+DSL演示

match_all

即无检索条件获取全部数据

#无条件分页检索 以 news_id 排序
GET /news/_doc/_search
{
    "query": {
        "match_all": {}
    },
    "from": 0,
    "size": 2,
    "sort": {
        "news_id": "desc"
    }
}

因为我们关掉了_source字段，即 ES 只会对数据建立倒排索引，不会存储其原数据，所以结果里没有相关文档原数据内容。关掉的原因主要是想演示highlight机制。

match

普通检索，很多文章都说match查询会对查询内容进行分词，其实并不完全正确，match查询也要看检索的字段type类型，如果字段类型本身就是不分词的keyword(not_analyzed)，那match就等同于term查询了。

我们可以通过分词器explain一下字段会被如何处理:

GET /news/_analyze
{
    "filed": "title",
    "text": "我会被如何处理呢？分词？不分词？"
}

查询

GET /news/_doc/_search
{
    "query": {
        "match": {
            "title": "我们会被分词"
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

通过highlight我们可以将检索到的关键词以高亮的方式返回上下文内容，如果关闭了_source就得开启字段的store属性存储字段的原数据，这样才能做高亮处理，不然没有原内容了，也就没办法高亮关键词了

multi_match

对多个字段进行检索，比如我想查询title或content中有我们关键词的文档，如下即可：

GET /news/_doc/_search
{
    "query": {
        "multi_match": {
            "query": "我们是好人",
            "fields": ["title", "content"]
        }
    },
    "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

match_phrase

这个需要认证理解一下，match_phrase，短语查询，何为短语查询呢？简单来说即被查询的文档字段中要包含查询内容被分词解析后的所有关键词，且关键词在文档中的分布距离差offset要满足slop设定的阈值。slop表征可以将关键词平移几次来满足在文档中的分布，如果slop足够的大，那么即便所有关键词在文档中分布的很离散，也是可以通过平移满足的。

content: i love china
match_phrase: i china
slop: 0//查不到 需要将 i china 的 china 关键词 slop 1 后变为 i - china 才能满足
slop: 1//查得到

测试实例

# 先看下查询会被如何解析分词
GET /news/_analyze
{
    "field": "title",
    "text": "我们学"
}
# reponse
{
    "tokens": [
        {
            "token": "我们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "学",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        }
    ]
}

# 再看下某文档的title是被怎样建立倒排索引的
GET /news/_analyze
{
    "field": "title",
    "text": "我们一起学旺叫"
}
# reponse
{
    "tokens": [
        {
            "token": "我们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "一起",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "学",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 2
        },
        ...
    ]
}

注意position字段，只有slop的阈值大于两个不相邻的关键词的position差时，才能满足平移关键词至查询内容短语分布的位置条件。

查询内容被分词为：["我们", "学"]，而文档中["我们", "学"]两个关键字的距离为 1，所以，slop必须大于等于1，此文档才能被查询到。

使用查询短语模式：

GET /news/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "我们学",
                "slop": 1
            }
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

查询结果：

{
            ...
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "if-CuGkBddO9SrfVBoil",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>我们</em>一起<em>学</em>猫叫"
                    ]
                }
            },
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "iP-AuGkBddO9SrfVOIg3",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>我们</em>一起<em>学</em>旺叫"
                    ]
                }
            }
            ...
}

term

term要理解只是不对查询条件分词，作为一个关键词去检索索引。但文档存储时字段是否被分词建立索引由_mappings时设定了。可能有["我们", "一起"]两个索引，但并没有["我们一起"]这个索引，查询不到。keyword类型的字段则存储时不分词，建立完整索引，查询时也不会对查询条件分词，是强一致性的。

GET /news/_doc/_search
{
    "query": {
        "term": {
           "title": "我们一起" 
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

terms

terms则是给定多个关键词，就好比人工分词

{
    "query": {
        "terms": {
           "title": ["我们", "一起"]
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

满足["我们", "一起"]任意关键字的文档都能被检索到。

wildcard

shell通配符查询: ? 一个字符 * 多个字符，查询倒排索引中符合pattern的关键词。

查询有两个字符的关键词的文档

{
   "query": {
       "wildcard": {
               "title": "??"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

prefix

前缀查询，查询倒排索引中符合pattern的关键词。

{
   "query": {
       "prefix": {
               "title": "我"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

regexp

正则表达式查询，查询倒排索引中符合pattern的关键词。

查询含有2 ~ 3 个字符的关键词的文档

{
   "query": {
       "regexp": {
               "title": ".{2,3}"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

bool

布尔查询通过 bool链接多个查询组合：
must：必须全满足
must_not：必须全不满足
should：满足一个即可

{
   "query": {
        "bool": {
            "must": {
                "match": {
                    "title": "绝对要有我们"
                }
            },
            "must_not": {
                "term": {
                    "title": "绝对不能有我"
                }
            },
            "should": [
                {
                    "match": {
                        "content": "我们"
                    }
                },
                {
                    "multi_match": {
                        "query": "满足",
                        "fields": ["title", "content"]
                    }
                },
                {
                    "match_phrase": {
                        "title": "一个即可"
                    }
                }
            ],
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2019-01-05 12:00:00"
                    }
                }
            }
        }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

filter

filter 通常情况下会配合match之类的使用，对符合查询条件的数据进行过滤。

{
   "query": {
        "bool": {
            "must": {
                "match_all": {}
            },
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
        }
   }
}

或者单独使用

{
   "query": {
       "constant_score" : {
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
       }
   }
}

多个过滤条件：2017-12-05 12:00:00 <= created_at < 2020-12-05 12:00:00 and news_id >= 2

{
   "query": {
       "constant_score" : {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "created_at": {
                                    "lt": "2020-12-05 12:00:00",
                                    "gt": "2017-12-05 12:00:00"
                                }
                            }
                        },
                        {
                            "range": {
                                "news_id": {
                                    "gte": 2
                                }
                            }
                        }
                    ]
                }
            }
       }
   }
}

elasticsearch 数据检索分词

安科网

Elasticsearch 索引创建 / 数据检索

qingmoucsdn

创建索引

添加文档

检索数据

match_all

match

multi_match

match_phrase

term

terms

wildcard

prefix

regexp

bool

filter

qingmoucsdn

相关推荐

Elasticsearch大文件检索性能提升20倍实践（干货）

Elasticsearch大文件检索性能提升20倍实践（干货）

Elasticsearch实战 | match_phrase搜不出来，怎么办？

ElasticSearch的下载、安装使用

DockerFile构建ElasticSearch镜像安装IK中文分词器插件

Elasticsearch py客户端库安装及使用方法解析

ElasticSearch最全详细使用教程

十张图说清Elasticsearch原理！

ElasticSearch 交互使用

django 对接elasticsearch实现全文检索

Spring Boot 集成 Elasticsearch 实战

如何对 ElasticSearch 集群进行压力测试

操作ElasticSearch插件和可视化工具 Kibana

Elasticsearch聚合后分页深入详解

重磅 | 死磕Elasticsearch方法论认知清单（国庆更新版）

Elasticsearch实战 | 必要的时候，还得空间换时间!

Elasticsearch索引增量统计及定时邮件实现

如何在Linux下安装部署分布式全文搜索引擎

我也是才知道ElasticSearch条件更新是这么玩的

读写成功率达99.999%，提升ElasticSearch系统稳定性的秘密

qingmoucsdn