ArangoDB分面搜索性能-Java 学习之路

我们正在评估AlatoDB在facet计算空间中的性能 . 通过特殊的API或查询语言，还有许多其他产品可以做同样的事情：

MarkLogic Facets
ElasticSearch聚合
Solr Faceting等

我们知道，Arango中没有特殊的API可以明确地计算出来 . 但实际上，它不是必需的，多亏了全面的AQL，它可以通过简单的查询轻松实现，如：

FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

此查询计算attribute1上的facet并以下列形式生成频率：

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

它说，在我的整个集合中，attribute1采用了三种形式（test-attr1-1，test-attr1-2和test-attr1-3）并提供了相关的计数 . 几乎我们运行DISTINCT查询和聚合计数 .

看起来简单干净 . 只有一个，但真正的大问题 - 性能 .

上面提供的查询运行时间为31秒！仅在8M文档的测试集合之上 . 我们已经尝试了不同的索引类型，存储引擎（使用rocksdb和没有），调查解释计划无济于事 . 我们在此测试中使用的测试文档非常简洁，只有三个短属性 .

我们希望此时有任何意见 . 要么我们做错了什么 . 或者ArangoDB根本不适合在这个特定领域执行 .

顺便说一下，最终的目标是在不到一秒的时间内运行如下内容：

LET docs = (FOR a IN Asset 

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs 

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs 

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs 

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs 

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs), 

    offset: 2, 

    to: 4, 

    facets: {

      attribute1: {

        from: 0, 

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5, 

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)

   }

}

谢谢！

1 回答

5
原来在ArangoDB Google Group上发生了主线程 . 这是link to a full discussion

以下是当前解决方案的摘要：
- 从特定功能分支运行Arango的自定义构建，其中已经完成了许多性能改进（希望它们应该很快进入主要版本）
- 构面计算不需要索引
- MMFiles是首选的存储引擎
- 应编写AQL以使用"COLLECT attr = a.attributeX WITH COUNT INTO length"而不是"count: length(g)"
- AQL应该分成更小的部分并且并行运行（我们运行Java8的Fork / Join来传播faces AQL，然后将它们加入到最终结果中）
- 一个AQL来过滤/排序和检索主要实体（如果需要的话 . 排序/过滤时添加相应的跳转列表索引）
- 其余是每个方面值/频率对的小AQL
最后，与上面提供的原始AQL相比，我们获得了 >10x 的性能提升 .
回复于 2024-04-27T22:49:57+08:00

ArangoDB分面搜索性能

1 回答

相关问题