删除大型postgresql数据库表中的重复行

提问于 2024-04-30T14:45:08+08:00

浏览次

0

我有一个100 GB大小的postgresql数据库 . 其中一个表有大约5亿个条目 . 为了快速输入数据，重复了一些数据并留待以后修剪 . 其中一列可用于将行标识为唯一 .

我找到了this stackoverflow question，它提出了mysql的解决方案：

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

postgresql有什么类似的东西吗？

我尝试使用group by和row number删除，在两种情况下，我的计算机在几个小时后内存不足 .

这是我在尝试估计表中的行数时得到的结果：

SELECT reltuples FROM pg_class WHERE relname = 'orders';
  reltuples  
-------------
 4.38543e+08
(1 row)

1 回答

1
立即想到两种解决方案：

1） . 使用WHERE子句从源表创建一个新表作为select *，以确定唯一行 . 添加索引以匹配源表，然后在事务中重命名它们 . 这是否适合您取决于几个因素，包括可用磁盘空间量，表是否持续使用以及允许访问中断等 . 创建新表具有紧密打包数据和索引的好处，表格将小于原始表格，因为省略了非唯一行 .

2） . 在列上创建部分唯一索引，并添加WHERE子句以过滤掉非唯一索引 . 例如：
```
test=# create table t ( col1 int, col2 int, is_unique boolean);
CREATE TABLE

test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
INSERT 0 3

test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
CREATE INDEX

test=# \d t
        Table "public.t"
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 col1      | integer | 
 col2      | integer | 
 is_unique | boolean | 
Indexes:
    "t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE
```
回复于 2024-04-30T14:45:08+08:00

相关问题