elasticsearch中数据聚合问题
由于项目中最近用到了elasticsearch,并且用到elasticsearch的聚合(Aggregation)功能,就深入研究了一下,elasticsearch中的聚合主要有四种:Bucketing Aggregation、Metric Aggregation、Matrix Aggregation和Pipeline Aggregation。
聚合的基本结构
"aggregations" : { "<aggregation_name>" : { --用户自己起的名字 "<aggregation_type>" : { --聚合类型,如avg, sum <aggregation_body> -- 针对的字段 } [,"meta" : { [<meta_data_body>] } ]? [,"aggregations" : { [<sub_aggregation>]+ } ]? --聚合里面可以嵌套聚合 } [,"<aggregation_name_2>" : { ... } ]* }
Metric Aggregation
Avg Aggregation--计算平均值
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "avg_value": { "avg": {"field": "value"} } } }
返回结果:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 315, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "avg_value" : { "value" : 342.84761904761905 } } }
之前看到其他博客上有说search_type=count可以只返回aggregation部分的结果,但我在7.x版本中试了下,好像不行,这边只能通过将size设为0来隐藏掉除了统计数据以外的数据。
Cardinality Aggregation--去重(相当于mysql中的distinct)
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "avg_value": { "cardinality": {"field": "service_id"} } } }
返回结果:
{ "took" : 11, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 317, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "avg_value" : { "value" : 2 } } }
Extended Status Aggragation--获取某个字段的所有统计信息(包括平均值,最大/小值....)
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "avg_status": { "extended_stats": { "field": "value" } } } }
返回结果:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 326, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "avg_status" : { "count" : 326, // 数量 "min" : 2.0, // 最小值 "max" : 2481.0, // 最大值 "avg" : 347.63803680981596, // 均值 "sum" : 113330.0, // 和 "sum_of_squares" : 1.02303634E8, "variance" : 192962.62358387595, "std_deviation" : 439.275111500613, "std_deviation_bounds" : { "upper" : 1226.188259811042, "lower" : -530.91218619141 } } } }
Max Aggregation--求最大值
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "max_value": { "max": { "field": "value" } } } }
返回结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 352, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "max_value" : { "value" : 2481.0 } } }
Min Aggreegation--计算最小值
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "min_value": { "min": { "field": "value" } } } }
返回结果:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 352, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "min_value" : { "value" : 2.0 } } }
Percentiles Aggregation -- 百分比统计,按照[ 1, 5, 25, 50, 75, 95, 99 ]来统计
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "value_outlier": { "percentiles": { "field": "value" } } } }
返回结果:
{ "took" : 44, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 334, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "value_outlier" : { "values" : { "1.0" : 4.0, "5.0" : 67.2, "25.0" : 91.33333333333333, "50.0" : 151.0, "75.0" : 420.0, "95.0" : 1412.4000000000005, "99.0" : 1906.32 } } } }
从返回结果可以看出来,75%的数据在420ms加载完毕。
当然我们也可以指定自己需要统计的百分比:
GET /endpoint_avg/_search { "size": 0, "aggs": { "value_outlier": { "percentiles": { "field": "value", "percents": [95, 96, 99, 99.5] } } } }
返回结果:
{ "took" : 20, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 330, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "value_outlier" : { "values" : { "95.0" : 1366.0, "96.0" : 1449.8000000000002, "99.0" : 1906.3999999999999, "99.5" : 2064.400000000004 } } } }
Percentile Ranks Aggregation -- 统计返回内数据的百分比
GET /endpoint_avg/_search { "size": 0, "aggs": { "value_range": { "percentile_ranks": { "field": "value", "values": [100, 200] } } } }
返回结果:
{ "took" : 10, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 346, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "value_range" : { "values" : { "100.0" : 32.51445086705203, "200.0" : 65.19405450041288 } } } }
从返回结果可以看出,在100ms左右加载完毕的占了32%, 200ms左右加载完毕的占了65%
Status Aggregation -- 状态统计
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "value_status": { "stats": { "field": "value" } } } }
返回结果:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 355, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "value_status" : { "count" : 355, "min" : 2.0, "max" : 2753.0, "avg" : 339.8112676056338, "sum" : 120633.0 } } }
可以发现跟之前的extended stats aggregation返回数据类似,只是少了一些较复杂的标准差之类的数据。
Sum Aggregation -- 求和函数
请求示例:
GET /endpoint_avg/_search { "size": 0, "query": {"term": { "service_id": { "value": 5 } }}, "aggs": { "sum_value": { "sum": { "field": "value" } } } }
返回结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 194, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "sum_value" : { "value" : 91322.0 } } }
Top Hits Aggregation -- 获取前n条数据, 可以嵌套使用
请求示例:
GET /endpoint_avg/_search { "size": 0, "aggs": { "top_tags": { "terms": { "field": "service_id", "size": 2 }, "aggs": { "top_value": { "top_hits": { "size": 3, "sort": [{ "time_bucket": {"order": "desc"} }] } } } } } }
返回结果:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 372, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "top_tags" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 5, "doc_count" : 198, "top_value" : { "hits" : { "total" : 198, "max_score" : null, "hits" : [ { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191621_25", "_score" : null, "_source" : { "service_id" : 5, "count" : 2, "time_bucket" : 201906191621, "service_instance_id" : 250, "entity_id" : "25", "value" : 149, "summation" : 299 }, "sort" : [ 201906191621 ] }, { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191620_24", "_score" : null, "_source" : { "service_id" : 5, "count" : 1, "time_bucket" : 201906191620, "service_instance_id" : 250, "entity_id" : "24", "value" : 93, "summation" : 93 }, "sort" : [ 201906191620 ] }, { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191620_37", "_score" : null, "_source" : { "service_id" : 5, "count" : 1, "time_bucket" : 201906191620, "service_instance_id" : 250, "entity_id" : "37", "value" : 122, "summation" : 122 }, "sort" : [ 201906191620 ] } ] } } }, { "key" : 3, "doc_count" : 174, "top_value" : { "hits" : { "total" : 174, "max_score" : null, "hits" : [ { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191621_144", "_score" : null, "_source" : { "service_id" : 3, "count" : 1, "time_bucket" : 201906191621, "service_instance_id" : 238, "entity_id" : "144", "value" : 93, "summation" : 93 }, "sort" : [ 201906191621 ] }, { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191620_70", "_score" : null, "_source" : { "service_id" : 3, "count" : 1, "time_bucket" : 201906191620, "service_instance_id" : 238, "entity_id" : "70", "value" : 192, "summation" : 192 }, "sort" : [ 201906191620 ] }, { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191620_18", "_score" : null, "_source" : { "service_id" : 3, "count" : 2, "time_bucket" : 201906191620, "service_instance_id" : 238, "entity_id" : "18", "value" : 81, "summation" : 162 }, "sort" : [ 201906191620 ] } ] } } } ] } } }
Value Count Aggregation--统计不同值的数量
请求示例:
GET /endpoint_avg/_search { "size": 2, "aggs": { "value_count": { "value_count": { "field": "value" } } } }
返回结果:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 357, "max_score" : 1.0, "hits" : [ { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191457_16", "_score" : 1.0, "_source" : { "service_id" : 3, "count" : 1, "time_bucket" : 201906191457, "service_instance_id" : 238, "entity_id" : "16", "value" : 129, "summation" : 129 } }, { "_index" : "endpoint_avg", "_type" : "type", "_id" : "201906191503_691", "_score" : 1.0, "_source" : { "service_id" : 5, "count" : 2, "time_bucket" : 201906191503, "service_instance_id" : 250, "entity_id" : "691", "value" : 178, "summation" : 357 } } ] }, "aggregations" : { "value_count" : { "value" : 357 } } }
基本metrics中常用的聚合函数就这几种,今天太累了,其他三类的聚合后续再做研究吧!
相关推荐
newbornzhao 2020-09-14
做对一件事很重要 2020-09-07
renjinlong 2020-09-03
明瞳 2020-08-19
李玉志 2020-08-19
mengyue 2020-08-07
molong0 2020-08-06
AFei00 2020-08-03
molong0 2020-08-03
wenwentana 2020-08-03
YYDU 2020-08-03
另外一部分,则需要先做聚类、分类处理,将聚合出的分类结果存入ES集群的聚类索引中。数据处理层的聚合结果存入ES中的指定索引,同时将每个聚合主题相关的数据存入每个document下面的某个field下。
sifeimeng 2020-08-03
心丨悦 2020-08-03
liangwenrong 2020-07-31
sifeimeng 2020-08-01
mengyue 2020-07-30
tigercn 2020-07-29
IceStreamLab 2020-07-29