Elasticsearch多字段模糊搜索首先不返回精确匹配-Java 学习之路

我正在对'text'和'keywords'字段执行模糊弹性搜索查询 . 我在elasticsearch中有两个文档，一个是“text”“testPhone 5”，另一个是“testPhone 4s” . 当我使用“testPhone 5”执行模糊查询时，我发现两个文档都被赋予了完全相同的分数值 . 为什么会这样？

额外信息：我使用'uax_url_email'标记器和'小写'过滤器索引文档 .

这是我正在进行的查询：

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

这是结果：

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }

2 回答

2
不对模糊查询进行分析，但是该字段是这样的，因此您搜索 testphone 5 ，距离为 0.4 会为两个文档生成分析的术语 testphone ，该术语用于进一步筛选结果

描述：'weight（文字： testphone ^ 3.8888888 in 0）[PerFieldSimilarity]，结果：'}，

另见@imotov优秀答案：ElasticSearch's Fuzzy Query

您可以使用 _analyze API查看字符串的标记方式

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

即

http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

将返回：
```
{
   "tokens": [
      {
         "token": "testphone",
         "start_offset": 0,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "5",
         "start_offset": 10,
         "end_offset": 11,
         "type": "<NUM>",
         "position": 2
      }
   ]
}
```
因此，即使您索引值 testphone sammsung ，"testphone samsunk"的模糊查询也不会产生任何只有 samsunk 的内容 .

通过不分析（或使用关键字分析器）字段，您可以获得更好的结果 .

如果要对单个字段进行不同的分析，可以使用 multi_field 构造 .

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html
回复于 2024-05-05T13:05:11+08:00
0

我最近自己遇到了这个问题 . 我无法确切地告诉你它为什么会发生，但我可以告诉你我是如何修理它的：

我在同一个字段上运行了2个查询，一个具有完全匹配，然后在同一字段上完全相同的查询，启用了模糊匹配和较低的提升 .

这确保了我的完全匹配总是比模糊匹配更高 .

附：我认为他们的得分是平等的，因为由于模糊性，两者的匹配和ES并不关心只要两者匹配就是一个完全匹配，但这是纯粹的理论制作，因为我不是非常熟悉评分算法 .

回复于 2024-05-05T13:05:11+08:00

Elasticsearch多字段模糊搜索首先不返回精确匹配

2 回答

相关问题