当前位置:  开发笔记 > 编程语言 > 正文

Elasticsearch通过过滤的子文档计数过滤父母

如何解决《Elasticsearch通过过滤的子文档计数过滤父母》经验,为你挑选了1个好方法。

我正在尝试对我拥有的一组数据执行一些弹性搜索查询.我有一个用户文档,它是许多子页面视图文档的父级.我希望返回已经查看特定页面任意次数的所有用户(由用户输入框定义).到目前为止,我有一个has_child查询,它将返回所有具有某些id的页面视图的用户.然而,这将使那些父母带着他们所有的孩子回归.接下来,我尝试在这些查询结果上编写聚合,这将基本上以聚合形式执行相同的has_child查询.现在,我有过滤子文档的正确文档计数.我需要使用此文档计数返回并过滤父项.要用单词解释查询,"将查看特定页面的所有用户返回给我4次以上".我可能需要重构我的数据.有什么想法吗?

这是我到目前为止的查询:

curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d '
{
    "query" : { 
      "has_child" : {
        "type" : "page_view",
        "query" : {
          "terms" : {
            "viewed_id" : [175,180]
          }
        }
      }
    },
    "aggs" : {
      "to_page_view": {
        "children": {
          "type" : "page_view"
        },
        "aggs" : {
          "page_views_that_match" : {
            "filter" : { "terms": { "viewed_id" : [175,180] } }
          }
        }
      }
    }
}'

这会给我一个回复,如:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "development_users",
      "_type" : "user",
      "_id" : "22548",
      "_score" : 1.0,
      "_source":{"id":22548,"account_id":1009}
    } ]
  },
  "aggregations" : {
    "to_page_view" : {
      "doc_count" : 53,
      "page_views_that_match" : {
        "doc_count" : 2
      }
    }
  }
}

相关映射:

{
  "development_users" : {
    "mappings" : {
      "page_view" : {
        "dynamic" : "false",
        "_parent" : {
          "type" : "user"
        },
        "_routing" : {
          "required" : true
        },
        "properties" : {
          "created_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "id" : {
            "type" : "integer"
          },
          "viewed_id" : {
            "type" : "integer"
          },
          "time_on_page" : {
            "type" : "integer"
          },
          "title" : {
            "type" : "string"
          },
          "type" : {
            "type" : "string"
          },
          "updated_at" : {
            "type" : "date",
            "format" : "date_time"
          },
          "url" : {
            "type" : "string"
          }
        }
      },
      "user" : {
        "dynamic" : "false",
        "properties" : {
          "account_id" : {
            "type" : "integer"
          },
          "id" : {
            "type" : "integer"
          }
        }
      }
    }
  }
}

Sloan Ahrens.. 6

好的,所以这是一种参与.我做了一些简化,以保持在我脑海中.首先,我使用了这个映射:

PUT /test_index
{
    "mappings": {
        "page_view": {
            "_parent": {
               "type": "development_user"
            },
            "properties": {
                "viewed_id": {
                    "type": "string"
                }
            }
        },
        "development_user": {
            "properties": {
                "id": {
                    "type": "string"
                }
            }
        }
    }
}

然后我添加了一些数据.在这个小小的宇宙中,我有三个用户和两个页面.我想查找"page_a"至少查看过两次的用户,因此如果我构造了正确的查询,则只3返回用户.

POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}

为了得到答案,我们将使用聚合.请注意,我不希望返回文档(正常方式),但我确实希望过滤掉我们分析的文档,因为它会提高效率.所以我使用你以前的基本过滤器.

因此,聚合树开始时terms_parent_id将仅分离父文档.在我的内部,children_page_view它将子文档过滤到我想要的那些("page_a"),并且在层次结构中bucket_selector_page_id_term_count它旁边是使用桶选择器(你需要ES 2.x)来过滤那些符合文档的文档标准,然后最后一个顶部命中聚合,向我们展示符合要求的文件.

POST /test_index/development_user/_search
{
   "size": 0,
   "query": {
      "has_child": {
         "type": "page_view",
         "query": {
            "terms": {
               "viewed_id": [
                  "page_a"
               ]
            }
         }
      }
   },
   "aggs": {
      "terms_parent_id": {
         "terms": {
            "field": "id"
         },
         "aggs": {
            "children_page_view": {
               "children": {
                  "type": "page_view"
               },
               "aggs": {
                  "filter_page_ids": {
                     "filter": {
                        "terms": {
                           "viewed_id": [
                              "page_a"
                           ]
                        }
                     }
                  }
               }
            },
            "bucket_selector_page_id_term_count": {
               "bucket_selector": {
                  "buckets_path": {
                     "children_count": "children_page_view>filter_page_ids._count"
                  },
                  "script": "children_count >= 2"
               }
            },
            "top_hits_users": {
               "top_hits": {
                  "_source": {
                     "include": [
                        "id"
                     ]
                  }
               }
            }
         }
      }
   }
}

返回:

{
   "took": 14,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "terms_parent_id": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "user_3",
               "doc_count": 1,
               "children_page_view": {
                  "doc_count": 3,
                  "filter_page_ids": {
                     "doc_count": 2
                  }
               },
               "top_hits_users": {
                  "hits": {
                     "total": 1,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "test_index",
                           "_type": "development_user",
                           "_id": "3",
                           "_score": 1,
                           "_source": {
                              "id": "user_3"
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

这是我使用的所有代码:

http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f



1> Sloan Ahrens..:

好的,所以这是一种参与.我做了一些简化,以保持在我脑海中.首先,我使用了这个映射:

PUT /test_index
{
    "mappings": {
        "page_view": {
            "_parent": {
               "type": "development_user"
            },
            "properties": {
                "viewed_id": {
                    "type": "string"
                }
            }
        },
        "development_user": {
            "properties": {
                "id": {
                    "type": "string"
                }
            }
        }
    }
}

然后我添加了一些数据.在这个小小的宇宙中,我有三个用户和两个页面.我想查找"page_a"至少查看过两次的用户,因此如果我构造了正确的查询,则只3返回用户.

POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}

为了得到答案,我们将使用聚合.请注意,我不希望返回文档(正常方式),但我确实希望过滤掉我们分析的文档,因为它会提高效率.所以我使用你以前的基本过滤器.

因此,聚合树开始时terms_parent_id将仅分离父文档.在我的内部,children_page_view它将子文档过滤到我想要的那些("page_a"),并且在层次结构中bucket_selector_page_id_term_count它旁边是使用桶选择器(你需要ES 2.x)来过滤那些符合文档的文档标准,然后最后一个顶部命中聚合,向我们展示符合要求的文件.

POST /test_index/development_user/_search
{
   "size": 0,
   "query": {
      "has_child": {
         "type": "page_view",
         "query": {
            "terms": {
               "viewed_id": [
                  "page_a"
               ]
            }
         }
      }
   },
   "aggs": {
      "terms_parent_id": {
         "terms": {
            "field": "id"
         },
         "aggs": {
            "children_page_view": {
               "children": {
                  "type": "page_view"
               },
               "aggs": {
                  "filter_page_ids": {
                     "filter": {
                        "terms": {
                           "viewed_id": [
                              "page_a"
                           ]
                        }
                     }
                  }
               }
            },
            "bucket_selector_page_id_term_count": {
               "bucket_selector": {
                  "buckets_path": {
                     "children_count": "children_page_view>filter_page_ids._count"
                  },
                  "script": "children_count >= 2"
               }
            },
            "top_hits_users": {
               "top_hits": {
                  "_source": {
                     "include": [
                        "id"
                     ]
                  }
               }
            }
         }
      }
   }
}

返回:

{
   "took": 14,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "terms_parent_id": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "user_3",
               "doc_count": 1,
               "children_page_view": {
                  "doc_count": 3,
                  "filter_page_ids": {
                     "doc_count": 2
                  }
               },
               "top_hits_users": {
                  "hits": {
                     "total": 1,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "test_index",
                           "_type": "development_user",
                           "_id": "3",
                           "_score": 1,
                           "_source": {
                              "id": "user_3"
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

这是我使用的所有代码:

http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f

推荐阅读
mobiledu2402851203
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有