首页 文章

Elasticsearch多字段模糊搜索首先不返回精确匹配

提问于
浏览
4

我正在对'text'和'keywords'字段执行模糊弹性搜索查询 . 我在elasticsearch中有两个文档,一个是“text”“testPhone 5”,另一个是“testPhone 4s” . 当我使用“testPhone 5”执行模糊查询时,我发现两个文档都被赋予了完全相同的分数值 . 为什么会这样?

额外信息:我使用'uax_url_email'标记器和'小写'过滤器索引文档 .

这是我正在进行的查询:

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

这是结果:

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }

2 回答

  • 2

    不对模糊查询进行分析,但是该字段是这样的,因此您搜索 testphone 5 ,距离为 0.4 会为两个文档生成分析的术语 testphone ,该术语用于进一步筛选结果

    描述:'weight(文字: testphone ^ 3.8888888 in 0)[PerFieldSimilarity],结果:'},

    另见@imotov优秀答案:ElasticSearch's Fuzzy Query

    您可以使用 _analyze API查看字符串的标记方式

    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

    http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

    将返回:

    {
       "tokens": [
          {
             "token": "testphone",
             "start_offset": 0,
             "end_offset": 9,
             "type": "<ALPHANUM>",
             "position": 1
          },
          {
             "token": "5",
             "start_offset": 10,
             "end_offset": 11,
             "type": "<NUM>",
             "position": 2
          }
       ]
    }
    

    因此,即使您索引值 testphone sammsung ,"testphone samsunk"的模糊查询也不会产生任何只有 samsunk 的内容 .

    通过不分析(或使用关键字分析器)字段,您可以获得更好的结果 .

    如果要对单个字段进行不同的分析,可以使用 multi_field 构造 .

    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

  • 0

    我最近自己遇到了这个问题 . 我无法确切地告诉你它为什么会发生,但我可以告诉你我是如何修理它的:

    我在同一个字段上运行了2个查询,一个具有完全匹配,然后在同一字段上完全相同的查询,启用了模糊匹配和较低的提升 .

    这确保了我的完全匹配总是比模糊匹配更高 .

    附:我认为他们的得分是平等的,因为由于模糊性,两者的匹配和ES并不关心只要两者匹配就是一个完全匹配,但这是纯粹的理论制作,因为我不是非常熟悉评分算法 .

相关问题