Postgres：使用IN vs JOIN vs选择大量行-Java 学习之路

我需要从~150M记录表中检索由唯一数字ID标识的1,000 - 50,000条记录 . 我们在AWS RDS中托管数据库 . 该表有几个 integer 列，一个 character varying(500) 和一个 bigint 用于id列 . 每列都有 btree 索引 .

当前的生产环境查询是

SELECT *
FROM mytable t
WHERE id IN (N1, N2, .. Nm)

如果m <1,000，则返回1秒以内，这是可以接受的 . 问题是时间随m线性增加 . m = 30,000时查询需要20秒 .

我们尝试创建索引临时表并使用INNER JOIN，但没有明显的性能提升 . （https://stackoverflow.com/a/24647700/226960）

这是m> 70k的缩写转储 .

CREATE TEMPORARY TABLE temp_phrases (phrase_id integer) ON COMMIT DROP
CREATE TABLE temp_phrases: OK, time: 0.01 seconds.
CREATE INDEX temp_phrases_phrase_id_idx ON temp_phrases(phrase_id)
CREATE INDEX '.temp_phrases.'_phrase_id_idx: OK, time: 0 seconds.
INSERT INTO TABLE temp_phrases: 70544, time: 0.3 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..665368.61 rows=79815 width=34)
->  Seq Scan on temp_phrases (cost=0.00..1111.15 rows=79815 width=4)
->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.31 rows=1 width=42)
    Index Cond: (id = temp_phrases.phrase_id)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 70544 results, 52.2seconds

如果我们重复查询，则需要不到一秒的时间来获得结果，这将指示与硬件相关的瓶颈
https://stackoverflow.com/a/24254825/226960

是否可以改进原始 id IN(_list_) 查询？获得额外的RDS IOPS会有帮助吗？

编辑
EXPLAIN (ANALYZE, BUFFERS) 输出

INSERT INTO TABLE temp_phrases: 41504, time: 0.17 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..396319.90 rows=46920 width=34) (actual time=0.708..23874.200 rows=41504 loops=1)
  Buffers: shared hit=167593 read=39458 dirtied=138, local hit=184
  ->  Seq Scan on temp_phrases  (cost=0.00..653.20 rows=46920 width=4) (actual time=0.012..21.138 rows=41504 loops=1)
        Buffers: local hit=184
  ->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.42 rows=1 width=42) (actual time=0.569..0.572 rows=1 loops=41504)
        Index Cond: (id = temp_phrases.phrase_id)
        Buffers: shared hit=167593 read=39458 dirtied=138
Planning time: 1.493 ms
Execution time: 23887.493 ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 41504 results, 24.2seconds

1 回答

0

使用btree索引，或根本没有索引，只会让它尝试匹配每一行的组合 .

如果你在比赛结束后让它停止，你将获得两倍的速度 .

临时表上的唯一键或主键 . 因为当你加入它时，它会知道一旦找到一个匹配，就不会找到其他匹配 .

根据我的经验，当从 A 中选择并加入 B 时，加入一个唯一键会有很大帮助 . 性能提升通常会减少50％的时间（与简单的索引相比） .

如果在其他情况下无法使用唯一键，则哈希索引将执行相同操作，但索引需要一段时间 .

但正如您在计划中所看到的，它执行序列扫描，而不是临时表的索引扫描 . 使用任何索引可能会有所帮助，因为那里有很多元组 .

回复于 2024-05-04T08:44:49+08:00

Postgres：使用IN vs JOIN vs选择大量行

1 回答

相关问题