优化非常大的csv文件中的搜索-Java 学习之路

我有一个带有单列的csv文件，但有620万行，所有行都包含6到20个字母之间的字符串 . 一些字符串将在重复（或更多）条目中找到，我想将它们写入新的csv文件 - 猜测应该有大约100万个非唯一字符串 . 就是这样，真的 . 然而，不断搜索600万条目的字典确实需要时间，我会很感激如何做到这一点 . 根据我所做的一些时间安排，到目前为止我写的任何脚本都需要至少一周（！）才能运行 .

第一次尝试：

in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)

# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
    ref_dict[row] = in_list_1[row]

# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
    Peptide = ref_dict.pop(n)
    if Peptide in ref_dict.values(): # Non-unique peptides
        Peptide_list.append(Peptide)
    else:
        Uniques.append(Peptide) # Unique peptides

for m in range(len(Peptide_list)):
    Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
    writer_1.writerow(Write_list)

第二次尝试：

in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)

ref_dict = {}
for row in range(len(in_list_1)):
    Peptide = in_list_1[row]
    if Peptide in ref_dict.values():
        write = (in_list_1[row],'')
        writer_1.writerow(write)
    else:
        ref_dict[row] = in_list_1[row]

编辑：这里是csv文件的几行：

SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR

4 回答

0
第一个提示：Python支持延迟评估，在处理大型数据集时更好地使用它 . 所以：
- 迭代你的csv.reader而不是 Build 一个巨大的内存列表，
- 不构建带有范围的巨大内存列表 - 如果需要项目和索引，则使用 enumate(seq) ，只需迭代序列's items if you don' t需要索引 .
第二个提示：使用 dict （哈希表）的主要目的是查找键，而不是值...所以不要将't build a huge dict that'用作列表 .

第三个提示：如果您只想要一种存储"already seen"值的方法，请使用 Set .
回复于 2024-04-19T16:13:56+08:00
2
用Numpy做吧 . 大致：
```
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
    if row[column] not in uniq:
        print row
```
您甚至可以使用 numpy.savetxt 和char数组运算符对输出阶段进行矢量化，但它可能不会产生很大的差异 .
回复于 2024-04-19T16:13:56+08:00
2

我在Python方面不太好，所以我不知道'in'是如何工作的，但你的算法似乎在n²中运行 . 尝试在阅读后对列表进行排序，使用n log（n）中的算法，如quicksort，它应该更好 . 订购列表后，您只需检查列表中的两个连续元素是否相同 .

所以你得到n中的读数，n log（n）中的排序（最好），以及n中的比较 .

回复于 2024-04-19T16:13:56+08:00
0
虽然我认为numpy解决方案是最好的，但我很好奇我们是否可以加快给定的例子 . 我的建议是：
- 跳过csv.reader费用，只需阅读该行
- rb跳过修复换行符所需的额外扫描
- 使用更大的文件缓冲区大小（读取1Meg，写入64K可能不错）
- 使用dict键作为索引 - 键查找比值查找快得多
我不是一个笨蛋，所以我会做类似的事情
```
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)

ref_dict = {}
for line in in_file_1:
    peptide = line.rstrip()
    if peptide in ref_dict:
        out_file_1.write(peptide + '\n')
    else:
        ref_dict[peptide] = None
```
回复于 2024-04-19T16:13:56+08:00

优化非常大的csv文件中的搜索

4 回答

相关问题