Elasticsearch如何实现SQL语句中 Group By 和 Limit 的功能
有 SQL 背景的同学在学习 Elasticsearch 时,面对一个查询需求,不由自主地会先思考如何用 SQL 来实现,然后再去想 Elasticsearch 的 Query DSL 如何实现。那么本篇就给大家讲一条常见的 SQL 语句如何用 Elasticsearch 的查询语言实现。
1. SQL语句
假设我们有一个汽车的数据集,每个汽车都有车型、颜色等字段,我希望获取颜色种类大于1个的前2车型。假设汽车的数据模型如下:
{ "model":"modelA", "color":"red" }
假设我们有一个 cars 表,通过如下语句创建测试数据。
INSERT INTO cars (model,color) VALUES ('A','red'); INSERT INTO cars (model,color) VALUES ('A','white'); INSERT INTO cars (model,color) VALUES ('A','black'); INSERT INTO cars (model,color) VALUES ('A','yellow'); INSERT INTO cars (model,color) VALUES ('B','red'); INSERT INTO cars (model,color) VALUES ('B','white'); INSERT INTO cars (model,color) VALUES ('C','black'); INSERT INTO cars (model,color) VALUES ('C','red'); INSERT INTO cars (model,color) VALUES ('C','white'); INSERT INTO cars (model,color) VALUES ('C','yellow'); INSERT INTO cars (model,color) VALUES ('C','blue'); INSERT INTO cars (model,color) VALUES ('D','red'); INSERT INTO cars (model,color) VALUES ('A','red');
那么实现我们需求的 SQL 语句也比较简单,实现如下:
SELECT model,COUNT(DISTINCT color) color_count FROM cars GROUP BY model HAVING color_count > 1 ORDER BY color_count desc LIMIT 2;
这条查询语句中 Group By 是按照 model 做分组, Having color_count>1 限定了车型颜色种类大于1,ORDER BY color_count desc 限定结果按照颜色种类倒序排列,而 LIMIT 2 限定只返回前3条数据。
那么在 Elasticsearch 中如何实现这个需求呢?
2. 在 Elasticsearch 模拟测试数据
首先我们需要先在 elasticsearch 中插入测试的数据,这里我们使用 bulk 接口 ,如下所示:
POST _bulk {"index":{"_index":"cars","_type":"doc","_id":"1"}} {"model":"A","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"2"}} {"model":"A","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"3"}} {"model":"A","color":"black"} {"index":{"_index":"cars","_type":"doc","_id":"4"}} {"model":"A","color":"yellow"} {"index":{"_index":"cars","_type":"doc","_id":"5"}} {"model":"B","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"6"}} {"model":"B","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"7"}} {"model":"C","color":"black"} {"index":{"_index":"cars","_type":"doc","_id":"8"}} {"model":"C","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"9"}} {"model":"C","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"10"}} {"model":"C","color":"yellow"} {"index":{"_index":"cars","_type":"doc","_id":"11"}} {"model":"C","color":"blue"} {"index":{"_index":"cars","_type":"doc","_id":"12"}} {"model":"D","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"13"}} {"model":"A","color":"red"}
其中 index 为 cars,type 为 doc,所有数据与mysql 数据保持一致。大家可以在 Kibana 的 Dev Tools 中执行上面的命令,然后执行下面的查询语句验证数据是否已经成功存入。
GET cars/_search
3. Group By VS Terms/Metric Aggregation
SQL 中 Group By 语句在 Elasticsearch 中对应的是 Terms Aggregation,即分桶聚合,对应 Group By color 的语句如下所示:
GET cars/_search { "size":0, "aggs":{ "models":{ "terms":{ "field":"model.keyword" } } } }
结果如下:
{ "took": 161, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5 }, { "key": "C", "doc_count": 5 }, { "key": "B", "doc_count": 2 }, { "key": "D", "doc_count": 1 } ] } } }
我们看 aggregations 这个 key 下面的即为返回结果。
SQL 语句中还有一项是 COUNT(DISTINCT color) color_count
用于计算每个 model 的颜色数,在 Elasticsearch 中我们需要使用一个指标类聚合 Cardinality ,进行不同值计数。语句如下:
GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } } } } } }
其返回结果如下:
{ "took": 74, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5, "color_count": { "value": 4 } }, { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "B", "doc_count": 2, "color_count": { "value": 2 } }, { "key": "D", "doc_count": 1, "color_count": { "value": 1 } } ] } } }
结果中 color_count 即为每个 model 的颜色数,但这里所有的模型都返回了,我们只想要颜色数大于1的模型,因此这里还要加一个过滤条件。
4. Having Condition VS Bucket Filter Aggregation
Having color_count > 1 在 Elasticsearch 中对应的是 Bucket Filter 聚合,语句如下所示:
GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } }, "color_count_filter": { "bucket_selector": { "buckets_path": { "colorCount": "color_count" }, "script": "params.colorCount>1" } } } } } }
返回结果如下:
{ "took": 39, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5, "color_count": { "value": 4 } }, { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "B", "doc_count": 2, "color_count": { "value": 2 } } ] } } }
此时返回结果只包含颜色数大于1的模型,但大家会发现颜色数多的 C 不是在第一个位置,我们还需要做排序处理。
5. Order By Limit VS Bucket Sort Aggregation
ORDER BY color_count desc LIMIT 3 在 Elasticsearch 中可以使用 Bucket Sort 聚合实现,语句如下所示:
GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } }, "color_count_filter": { "bucket_selector": { "buckets_path": { "colorCount": "color_count" }, "script": "params.colorCount>1" } }, "color_count_sort": { "bucket_sort": { "sort": { "color_count": "desc" }, "size": 2 } } } } } }
返回结果如下:
{ "took": 32, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "A", "doc_count": 5, "color_count": { "value": 4 } } ] } } }
至此我们便将 SQL 语句实现的功能用 Elasticsearch 查询语句实现了。对比 SQL 语句与 Elasticsearch 的查询语句,大家会发现后者复杂了很多,但并非无章可循,随着大家对常见语法越来越熟悉,相信一定会越写越得心应手!
相关推荐
另外一部分,则需要先做聚类、分类处理,将聚合出的分类结果存入ES集群的聚类索引中。数据处理层的聚合结果存入ES中的指定索引,同时将每个聚合主题相关的数据存入每个document下面的某个field下。