我正在尝试对我拥有的一组数据执行一些弹性搜索查询.我有一个用户文档,它是许多子页面视图文档的父级.我希望返回已经查看特定页面任意次数的所有用户(由用户输入框定义).到目前为止,我有一个has_child查询,它将返回所有具有某些id的页面视图的用户.然而,这将使那些父母带着他们所有的孩子回归.接下来,我尝试在这些查询结果上编写聚合,这将基本上以聚合形式执行相同的has_child查询.现在,我有过滤子文档的正确文档计数.我需要使用此文档计数返回并过滤父项.要用单词解释查询,"将查看特定页面的所有用户返回给我4次以上".我可能需要重构我的数据.有什么想法吗?
这是我到目前为止的查询:
curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d ' { "query" : { "has_child" : { "type" : "page_view", "query" : { "terms" : { "viewed_id" : [175,180] } } } }, "aggs" : { "to_page_view": { "children": { "type" : "page_view" }, "aggs" : { "page_views_that_match" : { "filter" : { "terms": { "viewed_id" : [175,180] } } } } } } }'
这会给我一个回复,如:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "development_users", "_type" : "user", "_id" : "22548", "_score" : 1.0, "_source":{"id":22548,"account_id":1009} } ] }, "aggregations" : { "to_page_view" : { "doc_count" : 53, "page_views_that_match" : { "doc_count" : 2 } } } }
相关映射:
{ "development_users" : { "mappings" : { "page_view" : { "dynamic" : "false", "_parent" : { "type" : "user" }, "_routing" : { "required" : true }, "properties" : { "created_at" : { "type" : "date", "format" : "date_time" }, "id" : { "type" : "integer" }, "viewed_id" : { "type" : "integer" }, "time_on_page" : { "type" : "integer" }, "title" : { "type" : "string" }, "type" : { "type" : "string" }, "updated_at" : { "type" : "date", "format" : "date_time" }, "url" : { "type" : "string" } } }, "user" : { "dynamic" : "false", "properties" : { "account_id" : { "type" : "integer" }, "id" : { "type" : "integer" } } } } } }
Sloan Ahrens.. 6
好的,所以这是一种参与.我做了一些简化,以保持在我脑海中.首先,我使用了这个映射:
PUT /test_index { "mappings": { "page_view": { "_parent": { "type": "development_user" }, "properties": { "viewed_id": { "type": "string" } } }, "development_user": { "properties": { "id": { "type": "string" } } } } }
然后我添加了一些数据.在这个小小的宇宙中,我有三个用户和两个页面.我想查找"page_a"
至少查看过两次的用户,因此如果我构造了正确的查询,则只3
返回用户.
POST /test_index/development_user/_bulk {"index":{"_type":"development_user","_id":1}} {"id":"user_1"} {"index":{"_type":"page_view","_parent":1}} {"viewed_id":"page_a"} {"index":{"_type":"development_user","_id":2}} {"id":"user_2"} {"index":{"_type":"page_view","_parent":2}} {"viewed_id":"page_b"} {"index":{"_type":"development_user","_id":3}} {"id":"user_3"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_a"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_a"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_b"}
为了得到答案,我们将使用聚合.请注意,我不希望返回文档(正常方式),但我确实希望过滤掉我们分析的文档,因为它会提高效率.所以我使用你以前的基本过滤器.
因此,聚合树开始时terms_parent_id
将仅分离父文档.在我的内部,children_page_view
它将子文档过滤到我想要的那些("page_a"
),并且在层次结构中bucket_selector_page_id_term_count
它旁边是使用桶选择器(你需要ES 2.x)来过滤那些符合父文档的父文档标准,然后最后一个顶部命中聚合,向我们展示符合要求的文件.
POST /test_index/development_user/_search { "size": 0, "query": { "has_child": { "type": "page_view", "query": { "terms": { "viewed_id": [ "page_a" ] } } } }, "aggs": { "terms_parent_id": { "terms": { "field": "id" }, "aggs": { "children_page_view": { "children": { "type": "page_view" }, "aggs": { "filter_page_ids": { "filter": { "terms": { "viewed_id": [ "page_a" ] } } } } }, "bucket_selector_page_id_term_count": { "bucket_selector": { "buckets_path": { "children_count": "children_page_view>filter_page_ids._count" }, "script": "children_count >= 2" } }, "top_hits_users": { "top_hits": { "_source": { "include": [ "id" ] } } } } } } }
返回:
{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0, "hits": [] }, "aggregations": { "terms_parent_id": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "user_3", "doc_count": 1, "children_page_view": { "doc_count": 3, "filter_page_ids": { "doc_count": 2 } }, "top_hits_users": { "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test_index", "_type": "development_user", "_id": "3", "_score": 1, "_source": { "id": "user_3" } } ] } } } ] } } }
这是我使用的所有代码:
http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f
好的,所以这是一种参与.我做了一些简化,以保持在我脑海中.首先,我使用了这个映射:
PUT /test_index { "mappings": { "page_view": { "_parent": { "type": "development_user" }, "properties": { "viewed_id": { "type": "string" } } }, "development_user": { "properties": { "id": { "type": "string" } } } } }
然后我添加了一些数据.在这个小小的宇宙中,我有三个用户和两个页面.我想查找"page_a"
至少查看过两次的用户,因此如果我构造了正确的查询,则只3
返回用户.
POST /test_index/development_user/_bulk {"index":{"_type":"development_user","_id":1}} {"id":"user_1"} {"index":{"_type":"page_view","_parent":1}} {"viewed_id":"page_a"} {"index":{"_type":"development_user","_id":2}} {"id":"user_2"} {"index":{"_type":"page_view","_parent":2}} {"viewed_id":"page_b"} {"index":{"_type":"development_user","_id":3}} {"id":"user_3"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_a"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_a"} {"index":{"_type":"page_view","_parent":3}} {"viewed_id":"page_b"}
为了得到答案,我们将使用聚合.请注意,我不希望返回文档(正常方式),但我确实希望过滤掉我们分析的文档,因为它会提高效率.所以我使用你以前的基本过滤器.
因此,聚合树开始时terms_parent_id
将仅分离父文档.在我的内部,children_page_view
它将子文档过滤到我想要的那些("page_a"
),并且在层次结构中bucket_selector_page_id_term_count
它旁边是使用桶选择器(你需要ES 2.x)来过滤那些符合父文档的父文档标准,然后最后一个顶部命中聚合,向我们展示符合要求的文件.
POST /test_index/development_user/_search { "size": 0, "query": { "has_child": { "type": "page_view", "query": { "terms": { "viewed_id": [ "page_a" ] } } } }, "aggs": { "terms_parent_id": { "terms": { "field": "id" }, "aggs": { "children_page_view": { "children": { "type": "page_view" }, "aggs": { "filter_page_ids": { "filter": { "terms": { "viewed_id": [ "page_a" ] } } } } }, "bucket_selector_page_id_term_count": { "bucket_selector": { "buckets_path": { "children_count": "children_page_view>filter_page_ids._count" }, "script": "children_count >= 2" } }, "top_hits_users": { "top_hits": { "_source": { "include": [ "id" ] } } } } } } }
返回:
{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0, "hits": [] }, "aggregations": { "terms_parent_id": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "user_3", "doc_count": 1, "children_page_view": { "doc_count": 3, "filter_page_ids": { "doc_count": 2 } }, "top_hits_users": { "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test_index", "_type": "development_user", "_id": "3", "_score": 1, "_source": { "id": "user_3" } } ] } } } ] } } }
这是我使用的所有代码:
http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f