Elasticsearch：在文档第2页中使用自定义分数字段对影响评分-Java 学习之路

有这些文件：

{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}

和

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}

我想根据每个标签的置信度值计算_score . 例如，如果你搜索“山”，它应该只返回id为1的doc，如果你搜索“landscape”，得分2应该高于1，因为2中的置信度高于1（48.36 vs 33.66） . 如果搜索“海岸景观”，则此时间分数1应高于2，因为文档1在标记数组中同时包含海岸和横向 . 我还希望将得分乘以“boost_multiplier”来提升一些文档来对抗其他人 .

我在SO中找到了这个问题，Elasticsearch: Influence scoring with custom score field in document

但是，当我尝试接受的解决方案（我在我的ES服务器中启用脚本）时，无论搜索词是什么，它都返回具有_score 1.0的两个文档 . 这是我试过的查询：

{
  "query": {
    "nested": {
      "path": "tags",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "tags.tag": "coast landscape"
            }
          },
          "script_score": {
            "script": "doc[\"confidence\"].value"
          }
        }
      }
    }
  }
}

我也尝试了@yahermann在评论中提出的建议，将“script_score”替换为“field_value_factor”：{“field”：“confidence”}，仍然是相同的结果 . 知道它失败的原因，还是有更好的方法呢？

为了得到完整的图片，这里是我使用的映射定义：

{
  "mappings": {
    "photo": {
      "properties": {
        "created_at": {
          "type": "date"
        },
        "description": {
          "type": "text"
        },
        "height": {
          "type": "short"
        },
        "id": {
          "type": "keyword"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "tag": { "type": "string" },
            "confidence": { "type": "float"}
          }
        },
        "width": {
          "type": "short"
        },
        "color": {
          "type": "string"
        },
        "boost_multiplier": {
          "type": "float"
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1
  }
}

UPDATE 在下面的@Joanna回答之后，我尝试了查询，但实际上，无论我在匹配查询，coast，foo，bar中放置什么，它总是返回两个文件都带有_score 1.0，我在elasticsearch 2.4上尝试过Docker中的.6,5.3,5.5.1 . 以下是我得到的回复：

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635

{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}
}]}}

UPDATE-2 我在SO上找到了这个：Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score

它基本上说，如果函数不匹配，它返回1.这是有道理的，但我正在运行查询相同的文档 . 这令人困惑 .

FINAL UPDATE 最后我发现了这个问题，愚蠢的我 . ES101，如果你发送GET请求搜索api，它会返回所有得分为1.0的文件:)你应该发送POST请求...很多@Joanna，它运作完美!!!

1 回答

2
您可以尝试此查询 - 它将评分与两者结合： confidence 和 boost_multiplier 字段：
```
{
  "query": {
    "function_score": {
        "query": {
            "bool": {
                "should": [{
                    "nested": {
                      "path": "tags",
                      "score_mode": "sum",
                      "query": {
                        "function_score": {
                          "query": {
                            "match": {
                              "tags.tag": "landscape"
                            }
                          },
                          "field_value_factor": {
                            "field": "tags.confidence",
                            "factor": 1,
                            "missing": 0
                          }
                        }
                      }
                    }
                }]
            }
        },
        "field_value_factor": {
            "field": "boost_multiplier",
            "factor": 1,
            "missing": 0
        }
      }
    }
}
```
When I search with coast term - it returns:

带有 id=1 的
- 文档只有这个有此术语，得分为 "_score": 100.27469 .
When I search with landscape term - it returns two documents:

带有 id=2 的
- 文档和得分"_score"：85.83046
  带有 id=1 的
- 文档和得分"_score"：59.7339
由于 id=2 的文档具有更高的 confidence 字段值，因此得分更高 .

When I search with coast landscape term - it returns two documents:

带有 id=1 的
- 文档和得分"_score"：160.00859
  带有 id=2 的
- 文档和得分"_score"：85.83046
虽然 id=2 的文档具有更高的 confidence 字段值，但带有 id=1 的文档具有两个匹配的单词，因此得分更高 . 通过更改 "factor": 1 参数的值，您可以决定 confidence 应该对结果产生多大影响 .

boost_muliplier字段

当我索引一个新文档时会发生更有趣的事情：让我们说它与 id=2 的文档几乎相同但是我设置了 "boost_multiplier" : 4 和 "id": 3 ：
```
{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "3",
  "tags" : [
    ...
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    ...
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 4
}
```
使用 coast landscape term运行相同的查询将返回三个文档：

带有 id=3 的
- 文档和得分"_score"：360.02664
  带有 id=1 的
- 文档和得分"_score"：182.09859
  带有 id=2 的
- 文档和得分"_score"：90.00666
虽然 id=3 的文档只有一个匹配的单词（ landscape ），但其 boost_multiplier 值大大增加了得分 . 在这里，使用 "factor": 1 ，您还可以决定该值应该增加多少得分，并且 "missing": 0 决定如果没有索引这样的字段会发生什么 .
回复于 2024-05-02T16:01:37+08:00

Elasticsearch：在文档第2页中使用自定义分数字段对影响评分

1 回答

boost_muliplier字段

相关问题