由于索引，SQLite插入速度会随着记录数量的增加而减慢-Java 学习之路

原始问题

背景

众所周知，SQLite needs to be fine tuned实现插入速度大约为50k insert / s . 这里有很多关于缓慢插入速度和大量建议和基准的问题 .

还有claims that SQLite can handle large amounts of data，50 GB的报告没有导致正确设置出现任何问题 .

我已经按照这里和其他地方的建议来实现这些速度，我很高兴35k-45k插入/秒 . 我遇到的问题是，所有基准测试只能显示<1m记录的快速插入速度 . 我所看到的是插入速度似乎与表格大小成反比 .

问题

我的用例要求在链接表中存储500m到1b元组（ [x_id, y_id, z_id] ）几年（1m行/天） . 值均为介于1和2,000,000之间的整数ID . z_id 上有一个索引 .

前10米行的性能很好，约35k插入/秒，但是当表有~20m行时，性能开始下降 . 我现在看到大约100个插入/秒 .

table 的大小不是特别大 . 对于20米行，磁盘上的大小约为500MB .

该项目是用Perl编写的 .

问题

这是SQLite中大型表的现实，还是有任何保密措施来维护> 10m行的表的高插入率？

如果可能的话，我想避免使用已知的解决方法

删除索引，添加记录并重新编制索引：这可以作为一种解决方法，但不能使数据库在x分钟/天内完全无法访问
将表分成较小的子表/文件：这将在短期内起作用，我已经对它进行了实验 . 问题是我需要能够在查询时从整个历史中检索数据，这意味着如果没有其他替代方案，最终我会尝试它 .
设置SQLITE_FCNTL_CHUNK_SIZE：我不想仅仅为了完成这项工作而学习它 . 我看不到使用Perl设置此参数的任何方法 .

更新

在Tim's suggestion之后，尽管SQLite声称它能够处理大型数据集，但索引导致插入时间越来越慢，我使用以下设置进行了基准比较：

插入行：1400万
提交批量大小：50,000条记录
cache_size pragma：10,000
page_size pragma：4,096
temp_store pragma：记忆
journal_mode pragma：删除
synchronous pragma：关闭

在我的项目中，如下面的基准测试结果，创建了基于文件的临时表，并使用了SQLite对导入CSV数据的内置支持 . 然后将临时表附加到接收数据库，并使用 insert-select 语句插入50,000行的集合 . 因此，插入时间不会将文件反映到数据库插入时间，而是反映表到表的插入速度 . 考虑CSV导入时间会使速度降低25-50％（非常粗略估计，导入CSV数据不需要很长时间） .

显然，随着表格大小的增加，索引会导致插入速度减慢 .

Plot of SQLite insert speed and table size

从上面的数据可以清楚地看出，正确的答案可以分配给Tim's answer而不是SQLite无法处理的断言 . 显然，如果索引该数据集不是您的用例的一部分，它可以处理大型数据集 . 我一直在使用SQLite作为日志记录系统的后端，暂时不需要编入索引，所以我对我经历的减速感到非常惊讶 .

结论

如果有人发现自己想要使用SQLite存储大量数据并将其编入索引，那么using shards可能就是答案 . 我最终决定使用MD5哈希的前三个字符 z 中的一个唯一列来确定对4,096个数据库之一的分配 . 由于我的用例主要是归档，因此架构不会更改，查询也不会需要碎片遍历 . 数据库大小有限，因为极其旧的数据将被减少并最终被丢弃，因此这种分片，编译指示设置甚至一些非规范化的组合给了我一个很好的 balancer ，基于上面的基准测试，将保持插入速度至少10k插入/秒 .

5 回答

0

如果你的要求是找到一个特定的z_id以及与之链接的x_ids和y_ids（与快速选择一系列z_ids不同），你可以查看一个非索引的哈希表嵌套关系数据库，它可以让你立即找到通往特定z_id的方式，以获得其y_ids和x_ids - 在索引增长的过程中，在插入过程中没有索引开销和随之而来的降低性能 . 为了避免聚集桶冲突，请选择一个密钥散列算法，该算法最大限度地考虑z_id的数字，具有最大的变化（右加权） .

附：使用b树的数据库可以在例如，第一个看起来比使用线性散列的数据库更快，但随着b-tree上的性能开始降低，插入性能将保持与线性散列的水平 .

P.P.S.回答kawing-chiu的问题：这里的核心特征是这样的数据库依赖于所谓的"sparse"表，其中记录的物理位置由散列算法确定，该算法将记录键作为输入 . 这种方法允许直接搜索表中记录的位置，而不需要索引的中介 . 由于不需要遍历索引或重新 balancer 索引，因此随着表变得更加密集，插入时间保持不变 . 相比之下，使用b树，随着索引树的增长，插入时间会降低 . 具有大量并发插入的OLTP应用程序可以从这种稀疏表方法中受益 . 记录分散在整个表格中 . 分散在稀疏表的"tundra"上的记录的缺点是收集具有共同值的大量记录（例如邮政编码）可能会更慢 . 散列稀疏表方法经过优化，可以插入和检索单个记录，并检索相关记录的网络，而不是具有一些共同字段值的大型记录集 .

嵌套关系数据库是允许行的列中的元组的数据库 .

回复于 2024-04-27T19:38:53+08:00

很棒的问题和非常有趣的后续行动！

我想简单地说一下：你提到将表分成较小的子表/文件并在以后附加它们不是一个选项，因为你认为你已经考虑过中途选项：将数据分成几个表但是继续使用相同的单个数据库（文件） .

我做了一个非常粗略的基准测试，以确保我的建议真的对性能产生影响 .

架构：

CREATE TABLE IF NOT EXISTS "test_$i"
(
    "i" integer NOT NULL,
    "md5" text(32) NOT NULL
);

数据 - 200万行：

i = 1..2,000,000
md5 = i 的md5十六进制摘要

每笔交易= 50,000 INSERT s .

数据库：1;表：1;指数：0

0..50000 records inserted in 1.87 seconds
50000..100000 records inserted in 1.92 seconds
100000..150000 records inserted in 1.97 seconds
150000..200000 records inserted in 1.99 seconds
200000..250000 records inserted in 2.19 seconds
250000..300000 records inserted in 1.94 seconds
300000..350000 records inserted in 1.94 seconds
350000..400000 records inserted in 1.94 seconds
400000..450000 records inserted in 1.94 seconds
450000..500000 records inserted in 2.50 seconds
500000..550000 records inserted in 1.94 seconds
550000..600000 records inserted in 1.94 seconds
600000..650000 records inserted in 1.93 seconds
650000..700000 records inserted in 1.94 seconds
700000..750000 records inserted in 1.94 seconds
750000..800000 records inserted in 1.94 seconds
800000..850000 records inserted in 1.93 seconds
850000..900000 records inserted in 1.95 seconds
900000..950000 records inserted in 1.94 seconds
950000..1000000 records inserted in 1.94 seconds
1000000..1050000 records inserted in 1.95 seconds
1050000..1100000 records inserted in 1.95 seconds
1100000..1150000 records inserted in 1.95 seconds
1150000..1200000 records inserted in 1.95 seconds
1200000..1250000 records inserted in 1.96 seconds
1250000..1300000 records inserted in 1.98 seconds
1300000..1350000 records inserted in 1.95 seconds
1350000..1400000 records inserted in 1.95 seconds
1400000..1450000 records inserted in 1.95 seconds
1450000..1500000 records inserted in 1.95 seconds
1500000..1550000 records inserted in 1.95 seconds
1550000..1600000 records inserted in 1.95 seconds
1600000..1650000 records inserted in 1.95 seconds
1650000..1700000 records inserted in 1.96 seconds
1700000..1750000 records inserted in 1.95 seconds
1750000..1800000 records inserted in 1.95 seconds
1800000..1850000 records inserted in 1.94 seconds
1850000..1900000 records inserted in 1.95 seconds
1900000..1950000 records inserted in 1.95 seconds
1950000..2000000 records inserted in 1.95 seconds

数据库文件大小：89.2 MiB .

数据库：1;表：1;指数：1（md5）

0..50000 records inserted in 2.90 seconds
50000..100000 records inserted in 11.64 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.62 seconds
200000..250000 records inserted in 11.28 seconds
250000..300000 records inserted in 12.09 seconds
300000..350000 records inserted in 10.60 seconds
350000..400000 records inserted in 12.25 seconds
400000..450000 records inserted in 13.83 seconds
450000..500000 records inserted in 14.48 seconds
500000..550000 records inserted in 11.08 seconds
550000..600000 records inserted in 10.72 seconds
600000..650000 records inserted in 14.99 seconds
650000..700000 records inserted in 10.85 seconds
700000..750000 records inserted in 11.25 seconds
750000..800000 records inserted in 17.68 seconds
800000..850000 records inserted in 14.44 seconds
850000..900000 records inserted in 19.46 seconds
900000..950000 records inserted in 16.41 seconds
950000..1000000 records inserted in 22.41 seconds
1000000..1050000 records inserted in 24.68 seconds
1050000..1100000 records inserted in 28.12 seconds
1100000..1150000 records inserted in 26.85 seconds
1150000..1200000 records inserted in 28.57 seconds
1200000..1250000 records inserted in 29.17 seconds
1250000..1300000 records inserted in 36.99 seconds
1300000..1350000 records inserted in 30.66 seconds
1350000..1400000 records inserted in 32.06 seconds
1400000..1450000 records inserted in 33.14 seconds
1450000..1500000 records inserted in 47.74 seconds
1500000..1550000 records inserted in 34.51 seconds
1550000..1600000 records inserted in 39.16 seconds
1600000..1650000 records inserted in 37.69 seconds
1650000..1700000 records inserted in 37.82 seconds
1700000..1750000 records inserted in 41.43 seconds
1750000..1800000 records inserted in 49.58 seconds
1800000..1850000 records inserted in 44.08 seconds
1850000..1900000 records inserted in 57.17 seconds
1900000..1950000 records inserted in 50.04 seconds
1950000..2000000 records inserted in 42.15 seconds

数据库文件大小：181.1 MiB .

数据库：1;表：20（每100,000个记录一个）;指数：1（md5）

0..50000 records inserted in 2.91 seconds
50000..100000 records inserted in 10.30 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.45 seconds
200000..250000 records inserted in 10.11 seconds
250000..300000 records inserted in 11.04 seconds
300000..350000 records inserted in 10.25 seconds
350000..400000 records inserted in 10.36 seconds
400000..450000 records inserted in 11.48 seconds
450000..500000 records inserted in 10.97 seconds
500000..550000 records inserted in 10.86 seconds
550000..600000 records inserted in 10.35 seconds
600000..650000 records inserted in 10.77 seconds
650000..700000 records inserted in 10.62 seconds
700000..750000 records inserted in 10.57 seconds
750000..800000 records inserted in 11.13 seconds
800000..850000 records inserted in 10.44 seconds
850000..900000 records inserted in 10.40 seconds
900000..950000 records inserted in 10.70 seconds
950000..1000000 records inserted in 10.53 seconds
1000000..1050000 records inserted in 10.98 seconds
1050000..1100000 records inserted in 11.56 seconds
1100000..1150000 records inserted in 10.66 seconds
1150000..1200000 records inserted in 10.38 seconds
1200000..1250000 records inserted in 10.24 seconds
1250000..1300000 records inserted in 10.80 seconds
1300000..1350000 records inserted in 10.85 seconds
1350000..1400000 records inserted in 10.46 seconds
1400000..1450000 records inserted in 10.25 seconds
1450000..1500000 records inserted in 10.98 seconds
1500000..1550000 records inserted in 10.15 seconds
1550000..1600000 records inserted in 11.81 seconds
1600000..1650000 records inserted in 10.80 seconds
1650000..1700000 records inserted in 11.06 seconds
1700000..1750000 records inserted in 10.24 seconds
1750000..1800000 records inserted in 10.57 seconds
1800000..1850000 records inserted in 11.54 seconds
1850000..1900000 records inserted in 10.80 seconds
1900000..1950000 records inserted in 11.07 seconds
1950000..2000000 records inserted in 13.27 seconds

数据库文件大小：180.1 MiB .

如您所见，如果将数据分成几个表，则插入速度将保持不变 .

回复于 2024-04-27T19:38:53+08:00

1

不幸的是，我在'd say this is a limitation of large tables in SQLite. It' s not designed上操作大型或大容量数据集 . 虽然我知道它可能会大大增加项目的复杂性，但您最好还是研究适合您需求的更复杂的数据库解决方案 .

从您链接的所有内容来看，表格大小与访问速度相似是直接的权衡 . 不能兼得 .

回复于 2024-04-27T19:38:53+08:00
12

在我的项目中，我无法对数据库进行分片，因为它在不同的列上编制索引 . 为了加速插入，我在创建期间将数据库放在/ dev / shm（= linux ramdisk）上，然后将其复制到本地磁盘 . 这显然只适用于一次写入，多次读取的数据库 .

回复于 2024-04-27T19:38:53+08:00
7

我怀疑索引的哈希值冲突导致插入速度变慢 .

当我们在一个表中有很多行，然后索引列哈希值冲突将更频繁地发生 . 这意味着Sqlite引擎需要计算哈希值两次或三次，或者甚至四次，以获得不同的哈希值 .

所以我猜这是表格有很多行时SQLite插入缓慢的根本原因 .

这一点可以解释为什么使用分片可以避免这个问题 . 谁是SQLite域名的真正专家，以确认或否认我的观点？

回复于 2024-04-27T19:38:53+08:00

由于索引，SQLite插入速度会随着记录数量的增加而减慢