Elasticsearch看似随机得分和匹配-Java 学习之路

我正在使用 bool 搜索来匹配多个字段 . 这些字段已在索引时使用多个过滤器进行分析，但主要使用 edge_ngram .

我遇到的问题是得分似乎在空中 . 我希望我的搜索 savvas 首先匹配 Savvas 的 first_name 字段中的一个，但是它们的得分要晚得多 . 例如，搜索 savvas 按得分顺序返回：

First name | Last name       | Email
___________|_________________|________________________
------     | Sav---          | ---@sa-------------.com
-----s     | Sa----          | sa----------s@-----.com
Sa----     | ----            | sa---------@-------.com  
Sa----     | --------        | sa-------@---------.com
sa-        | -----           | sa----------@------.com
Sa--       | ----s-----s     | sa------s-----s@---.com
Sa----     | -----------     | sa-----@-----------.com
Savvas     | -------s        | ----------@--------.com
Savvas     | -------s        | --------@----------.com
Sa-        | ---s----S------ | sa------s-----@----.com

我已使用 - 替换了字段中搜索词的边缘n-gram以外的字符，并修改了电子邮件的长度以保护身份 .

事实上搜索 ssssssssssssssss 虽然它在我的数据中不存在，但返回其中包含最多 s 个字符的项目 . 我不希望发生的事情，因为我没有对我的搜索进行任何手动ngram .

当我尝试搜索电话号码时，也会出现此问题，我在搜索 782 时通过电话号码匹配任何包含字符 78 的电子邮件，这些电话号码具有 782 作为确切的ngrams .

似乎elasticsearch也在我的搜索查询上执行ngrams而不仅仅是字段并且比较两者并且在某种程度上有利于更短的匹配 .

这是我的查询：

{
    'bool': {
        'should': [ // Any one of these matches will return a result
            {
                'match': {
                    'phone': {
                        'query': $searchString,
                        'fuzziness': '0',
                        'boost': 3 // If phone matches give it precedence
                    }
                }
            },
            {
                'match': {
                    'email': {
                        'query': $searchString,
                        'fuzziness': '0'
                    }
                }
            },
            {
                'multi_match': {
                    'query': $searchString,
                    'type': 'cross_fields', // Match if any term is in any of the fields
                    'fields': ['name.first_name', 'name.last_name'],
                    'fuzziness': '0'
                }
            }
        ],
        'minimum_should_match': 1
    }
}

和它一起使用的索引设置（为冗长而道歉，但我不想排除任何可能重要的内容）：

{
    "settings":{
        "analysis":{
            "char_filter":{
                "trim":{
                    "type":"pattern_replace",
                    "pattern":"^\\s*(.*)\\s*$",
                    "replacement":"$1"
                },
                "tel_strip_chars":{
                    "type":"pattern_replace",
                    "pattern":"^(\\(\\d+\\))|^(\\+)|\\D",
                    "replacement":"$1$2"
                },
                "tel_uk_exit_coded":{
                    "type":"pattern_replace",
                    "pattern":"^00(\\d+)",
                    "replacement":"+$1"
                },
                "tel_parenthesized_country_code":{
                    "type":"pattern_replace",
                    "pattern":"^\\((\\d+)\\)(\\d+)",
                    "replacement":"+$1$2"
                }
            },
            "tokenizer":{
                "intl_tel_country_code": {
                    "type":"pattern",
                    "pattern":"\\+(9[976]\\d|8[987530]\\d|6[987]\\d|5[90]\\d|42\\d|3[875]\\d|2[98654321]\\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|4[987654310]|3[9643210]|2[70]|7|1)(\\d{1,14})$",
                    "group":0
                }
            },
            "filter":{
                "autocomplete":{
                    "type":"edge_ngram",
                    "min_gram":1,
                    "max_gram":50
                },
                "autocomplete_tel":{
                    "type":"ngram",
                    "min_gram":3,
                    "max_gram":20
                },
                "email":{
                    "type":"pattern_capture",
                    "preserve_original":1,
                    "patterns":[
                        "([^@]+)",
                        "(\\p{L}+)",
                        "(\\d+)",
                        "@(.+)",
                        "([^-@]+)"
                    ]
                }
            },
            "analyzer":{
                "name":{
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "trim",
                        "lowercase",
                        "asciifolding",
                        "autocomplete"
                    ]
                },
                "email":{
                    "type":"custom",
                    "tokenizer":"uax_url_email",
                    "filter":[
                        "trim",
                        "lowercase",
                        "email",
                        "unique",
                        "autocomplete"
                    ]
                },
                "phone":{
                    "type":"custom",
                    "tokenizer":"intl_tel_country_code",
                    "char_filter":[
                        "trim",
                        "tel_strip_chars",
                        "tel_uk_exit_coded",
                        "tel_parenthesized_country_code"
                    ],
                    "filter":[
                        "autocomplete_tel"
                    ]
                }
            }
        }
    },
    "mappings":{
        "person":{
            "properties":{
                "address":{
                    "properties":{
                        "country":{
                            "type":"string",
                            "index_name":"country"
                        }
                    }
                },
                "timezone":{
                    "type":"string"
                },
                "name":{
                    "properties":{
                        "first_name":{
                            "type":"string",
                            "analyzer":"name"
                        },
                        "last_name":{
                            "type":"string",
                            "analyzer":"name"
                        }
                    }
                },
                "email":{
                    "type":"string",
                    "analyzer":"email"
                },
                "phone":{
                    "type":"string",
                    "analyzer":"phone"
                },
                "id":{
                    "type":"string"
                }
            }
        }
    }
}

我已经使用Kopf插件的分析器测试了索引设置，它似乎创建了正确的令牌 .

理想情况下，我只会完全匹配我的索引创建的标记，并优先考虑我的bool应该查询中的一个更精确的匹配，而不是优先考虑多个bool应匹配 .

但是，如果至少它只匹配精确的令牌，我会很高兴 . 我不能使用 term 搜索，因为我的搜索字符串本身需要被标记化，只是没有应用任何ngram .

To sum up my requirements:

在任何单个字段中按可能的最大匹配得分 .
然后在任何单个字段中按可能匹配的最低偏移得分 .
然后按匹配的字段数进行评分，优先考虑较低的偏移匹配

--- Update: ---

我使用 dis_max 获得了更好的结果，它似乎成功匹配多个ngram匹配的更多ngram匹配，除了仍然难以查询的 phone 字段 . 这是新的查询：

{
    'dis_max': {
        'tie_breaker': 0.0,
        'boost': 1.5,
        'queries': [ // Any one of these matches will return a result
            [
                'match': {
                    'phone': {
                        'query': $searchString,
                        'boost': 1.9
                    }
                }
            ],
            [
                'match': {
                    'email': {
                        'query': $searchString
                    }
                }
            ],
            [
                'multi_match': {
                    'query': $searchString,
                    'type': 'cross_fields', // Match if any term is in any of the fields
                    'fields': ['name.first_name', 'name.last_name'],
                    'tie_breaker': 0.1,
                    'boost': 1.5
                }
            ]
        }
    }
}

1 回答

可能你不想在搜索字符串上使用自动完成，即名称分析器，只在索引期间，即映射应该是：

"first_name": {
    "type":"string",
    "index_analyzer":"name"
}

另外，要在多匹配中对first_name的匹配得分高于last_name，您可以提供字段级提升，如下所示：

示例：last_name匹配与first_name相关的一半

{
    'dis_max': {
        'tie_breaker': 0.0,
        'boost': 1.5,
        'queries': [ // Any one of these matches will return a result
            [
                'match': {
                    'phone': {
                        'query': $searchString,
                        'boost': 1.9
                    }
                }
            ],
            [
                'match': {
                    'email': {
                        'query': $searchString
                    }
                }
            ],
            [
                'multi_match': {
                    'query': $searchString,
                    'type': 'cross_fields', // Match if any term is in any of the fields
                    'fields': ['name.first_name', 'name.last_name^0.5'],
                    'tie_breaker': 0.1,
                    'boost': 1.5
                }
            ]
        }
    }
}

回复于 2024-04-29T22:27:02+08:00

Elasticsearch看似随机得分和匹配

1 回答

相关问题