Python中的快速方法，使用行数作为输入变量来分割大型文本文件-Java 学习之路

我使用行数作为变量拆分文本文件 . 我写了这个函数，以便在临时目录中保存spitted文件 . 每个文件有4百万行期望最后一个文件 .

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

主要问题是这个功能的速度 . 为了在400万行的两个文件中拆分一个800万行的文件，时间超过了我的Windows操作系统和Python 2.7的30分钟 .

4 回答

for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

正在打开文件，并在组中写入一行 for each line . 这很慢 .

相反，每组写一次 .

with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))

回复于 2024-04-29T15:44:57+08:00

您可以直接在上下文管理器中使用tempfile.NamedTemporaryFile：

import tempfile
import time
from itertools import groupby, count

def tempfile_split(filename, temp_dir, chunk=4*10**6):
    fns={}
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            with tempfile.NamedTemporaryFile(delete=False,
                           dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
                outfile.write(''.join(group))
                fns[k]=outfile.name   
    return fns                     

def make_test(size=8*10**6+1000):
    with tempfile.NamedTemporaryFile(delete=False) as fn:
        for i in xrange(size):
            fn.write('Line {}\n'.format(i))

    return fn.name        

fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0

在我的计算机上， tempfile_split 部分在3.6秒内运行 . 它是OS X.

回复于 2024-04-29T15:44:57+08:00

如果你在linux或unix环境中，你可以作弊，并使用python中的 split 命令 . 对我而言，这个技巧也非常快：

def split_file(file_path, chunk=4000):

    p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
                          os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.communicate()
    # Remove the original file if required
    try:
        os.remove(file_path)
    except OSError:
        pass
    return True

回复于 2024-04-29T15:44:57+08:00

1
刚刚用800万行文件（正常运行时间线）进行了快速测试，以运行文件的长度并将文件分成两半 . 基本上，一次通过获取行计数，第二次通过进行拆分写入 .

在我的系统上，执行第一次传递所花费的时间大约是2-3秒 . 要完成分割文件的运行和写入，总时间不到21秒 .

没有在OP的帖子中实现lamba函数 . 使用的代码如下：
```
#!/usr/bin/env python

import sys
import math

infile = open("input","r")

linecount=0

for line in infile:
    linecount=linecount+1

splitpoint=linecount/2

infile.close()

infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")

print linecount , splitpoint

linecount=0

for line in infile:
    linecount=linecount+1
    if ( linecount <= splitpoint ):
        outfile1.write(line)
    else:
        outfile2.write(line)

infile.close()
outfile1.close()
outfile2.close()
```
不，它不会赢得任何性能或代码优雅测试 . :)但缺少其他东西是性能瓶颈，lambda函数导致文件缓存在内存中并强制交换问题，或者文件中的行非常长，我不明白为什么需要30分钟阅读/拆分800万行文件 .

编辑：

我的环境：Mac OS X，存储是一个FW800连接的硬盘 . 文件是新创建的，以避免文件系统缓存的好处 .
回复于 2024-04-29T15:44:57+08:00

Python中的快速方法，使用行数作为输入变量来分割大型文本文件

4 回答

相关问题