在Elasticsearch中搜索iphone时,努力使iPhone匹配.
由于我有一些利害攸关的源代码,我当然需要CamelCase tokenizer,但它似乎将iPhone分成两个术语,所以无法找到iphone.
任何人都知道一种方法来添加异常以将camelCase单词分解为标记(camel + case)?
更新:为了说清楚,我希望将NullPointerException标记为[null,pointer,exception],但我不希望iPhone成为[i,phone].
还有其他方法吗?
更新2:@ ChintanShah的回答表明了一种不同的方法,它给了我们更多的东西 - NullPointerException将被标记为[null,pointer,exception,nullpointer,pointerexception,nullpointerexception],从这个方面来看,这肯定会更有用.搜索.索引也更快!支付价格是指数大小,但它是一个优秀的解决方案.
您可以使用word_delimiter令牌过滤器来满足您的要求.这是我的设置
{ "settings": { "analysis": { "analyzer": { "camel_analyzer": { "tokenizer": "whitespace", "filter": [ "camel_filter", "lowercase", "asciifolding" ] } }, "filter": { "camel_filter": { "type": "word_delimiter", "generate_number_parts": false, "stem_english_possessive": false, "split_on_numerics": false, "protected_words": [ "iPhone", "WiFi" ] } } } }, "mappings": { } }
这将在案例更改时拆分单词,因此NullPointerException
将被标记为null,指针和异常,但iPhone和WiFi将保持原样,因为它们受到保护.word_delimiter
有很多选择灵活性.您还可以使用preserve_original来帮助您.
GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
结果
{ "tokens": [ { "token": "iphone", "start_offset": 0, "end_offset": 6, "type": "word", "position": 1 } ] }
现在用
GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
结果
{ "tokens": [ { "token": "null", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 }, { "token": "pointer", "start_offset": 4, "end_offset": 11, "type": "word", "position": 2 }, { "token": "exception", "start_offset": 11, "end_offset": 20, "type": "word", "position": 3 } ] }
另一种方法是用不同的分析仪分析你的场两次,但我觉得word_delimiter会做的.
这有帮助吗?