首页 文章

如何在Python中廉价地获得行数?

提问于
浏览
786

我需要在python中获取大文件(数十万行)的行数 . 记忆和时间方面最有效的方法是什么?

目前我这样做:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

有可能做得更好吗?

30 回答

  • 7
    def file_len(full_path):
      """ Count number of lines in a file."""
      f = open(full_path)
      nr_of_lines = sum(1 for line in f)
      f.close()
      return nr_of_lines
    
  • 6

    同理:

    lines = 0
    with open(path) as f:
        for line in f:
            lines += 1
    
  • 4

    另一种可能性

    import subprocess
    
    def num_lines_in_file(fpath):
        return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])
    
  • 183

    我会使用Python的文件对象方法 readlines ,如下所示:

    with open(input_file) as foo:
        lines = len(foo.readlines())
    

    这将打开文件,在文件中创建行列表,计算列表的长度,将其保存到变量并再次关闭文件 .

  • 3

    打开文件的结果是一个迭代器,它可以转换为一个序列,其长度为:

    with open(filename) as f:
       return len(list(f))
    

    这比显式循环更简洁,并避免使用 enumerate .

  • 4

    一线解决方案

    import os
    os.system("wc -l  filename")
    

    我的片段

    os.system('wc -l * .txt')

    0 bar.txt
    1000 command.txt
    3 test_file.txt
    1003 total
    
  • 11

    为什么不读取前100行和后100行并估计平均行长,然后将总文件大小除以这些数字?如果您不需要确切的值,这可能会起作用 .

  • 3

    这是一个python程序,它使用多处理库来分配跨机器/核心的行计数 . 我的测试使用8核Windows 64服务器改进了计算2000万行文件的时间从26秒到7秒 . 注意:不使用内存映射会使事情变得更慢 .

    import multiprocessing, sys, time, os, mmap
    import logging, logging.handlers
    
    def init_logger(pid):
        console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
        logger = logging.getLogger()  # New logger at root level
        logger.setLevel( logging.INFO )
        logger.handlers.append( logging.StreamHandler() )
        logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
    
    def getFileLineCount( queues, pid, processes, file1 ):
        init_logger(pid)
        logging.info( 'start' )
    
        physical_file = open(file1, "r")
        #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]
    
        m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
    
        #work out file size to divide up line counting
    
        fSize = os.stat(file1).st_size
        chunk = (fSize / processes) + 1
    
        lines = 0
    
        #get where I start and stop
        _seedStart = chunk * (pid)
        _seekEnd = chunk * (pid+1)
        seekStart = int(_seedStart)
        seekEnd = int(_seekEnd)
    
        if seekEnd < int(_seekEnd + 1):
            seekEnd += 1
    
        if _seedStart < int(seekStart + 1):
            seekStart += 1
    
        if seekEnd > fSize:
            seekEnd = fSize
    
        #find where to start
        if pid > 0:
            m1.seek( seekStart )
            #read next line
            l1 = m1.readline()  # need to use readline with memory mapped files
            seekStart = m1.tell()
    
        #tell previous rank my seek start to make their seek end
    
        if pid > 0:
            queues[pid-1].put( seekStart )
        if pid < processes-1:
            seekEnd = queues[pid].get()
    
        m1.seek( seekStart )
        l1 = m1.readline()
    
        while len(l1) > 0:
            lines += 1
            l1 = m1.readline()
            if m1.tell() > seekEnd or len(l1) == 0:
                break
    
        logging.info( 'done' )
        # add up the results
        if pid == 0:
            for p in range(1,processes):
                lines += queues[0].get()
            queues[0].put(lines) # the total lines counted
        else:
            queues[0].put(lines)
    
        m1.close()
        physical_file.close()
    
    if __name__ == '__main__':
        init_logger( 'main' )
        if len(sys.argv) > 1:
            file_name = sys.argv[1]
        else:
            logging.fatal( 'parameters required: file-name [processes]' )
            exit()
    
        t = time.time()
        processes = multiprocessing.cpu_count()
        if len(sys.argv) > 2:
            processes = int(sys.argv[2])
        queues=[] # a queue for each process
        for pid in range(processes):
            queues.append( multiprocessing.Queue() )
        jobs=[]
        prev_pipe = 0
        for pid in range(processes):
            p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
            p.start()
            jobs.append(p)
    
        jobs[0].join() #wait for counting to finish
        lines = queues[0].get()
    
        logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
    
  • 499

    此代码更短更清晰 . 这可能是最好的方式:

    num_lines = open('yourfile.ext').read().count('\n')
    
  • 35

    我使用这个版本获得了一个小的(4-8%)改进,它重新使用了一个常量缓冲区,所以它应该避免任何内存或GC开销:

    lines = 0
    buffer = bytearray(2048)
    with open(filename) as f:
      while f.readinto(buffer) > 0:
          lines += buffer.count('\n')
    

    您可以使用缓冲区大小,也许可以看到一些改进 .

  • 4

    我相信内存映射文件将是最快的解决方案 . 我尝试了四个函数:OP发布的函数( opcount );对文件中的行进行简单迭代( simplecount );带有内存映射字段(mmap)的readline( mapcount );以及Mykola Kharechko( bufcount )提供的缓冲读取解决方案 .

    我运行了五次每个函数,并计算了一个120万行文本文件的平均运行时间 .

    Windows XP,Python 2.5,2GB RAM,2 GHz AMD处理器

    这是我的结果:

    mapcount : 0.465599966049
    simplecount : 0.756399965286
    bufcount : 0.546800041199
    opcount : 0.718600034714
    

    Edit :Python 2.6的数字:

    mapcount : 0.471799945831
    simplecount : 0.634400033951
    bufcount : 0.468800067902
    opcount : 0.602999973297
    

    因此缓冲区读取策略似乎是Windows / Python 2.6中最快的

    这是代码:

    from __future__ import with_statement
    import time
    import mmap
    import random
    from collections import defaultdict
    
    def mapcount(filename):
        f = open(filename, "r+")
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines
    
    def simplecount(filename):
        lines = 0
        for line in open(filename):
            lines += 1
        return lines
    
    def bufcount(filename):
        f = open(filename)                  
        lines = 0
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
    
        buf = read_f(buf_size)
        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)
    
        return lines
    
    def opcount(fname):
        with open(fname) as f:
            for i, l in enumerate(f):
                pass
        return i + 1
    
    
    counts = defaultdict(list)
    
    for i in range(5):
        for func in [mapcount, simplecount, bufcount, opcount]:
            start_time = time.time()
            assert func("big_file.txt") == 1209138
            counts[func].append(time.time() - start_time)
    
    for key, vals in counts.items():
        print key.__name__, ":", sum(vals) / float(len(vals))
    
  • 74

    我修改了这样的缓冲区:

    def CountLines(filename):
        f = open(filename)
        try:
            lines = 1
            buf_size = 1024 * 1024
            read_f = f.read # loop optimization
            buf = read_f(buf_size)
    
            # Empty file
            if not buf:
                return 0
    
            while buf:
                lines += buf.count('\n')
                buf = read_f(buf_size)
    
            return lines
        finally:
            f.close()
    

    现在也是空文件,最后一行(没有\ n)被计算在内 .

  • 7

    这是我用的,看起来很干净:

    import subprocess
    
    def count_file_lines(file_path):
        """
        Counts the number of lines in a file using wc utility.
        :param file_path: path to file
        :return: int, no of lines
        """
        num = subprocess.check_output(['wc', '-l', file_path])
        num = num.split(' ')
        return int(num[0])
    

    更新:这比使用纯python略快,但代价是内存使用 . 在执行命令时,子进程将使用与父进程相同的内存占用分叉新进程 .

  • 2

    如果有人想在Linux中用Python来廉价地获得行数,我推荐这种方法:

    import os
    print os.popen("wc -l file_path").readline().split()[0]
    

    file_path既可以是抽象文件路径,也可以是相对路径 . 希望这可能有所帮助 .

  • 274

    至于我,这个变种将是最快的:

    #!/usr/bin/env python
    
    def main():
        f = open('filename')                  
        lines = 0
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
    
        buf = read_f(buf_size)
        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)
    
        print lines
    
    if __name__ == '__main__':
        main()
    

    原因:缓冲比逐行读取更快, string.count 也非常快

  • 0

    这个单行怎么样:

    file_length = len(open('myfile.txt','r').read().split('\n'))
    

    使用此方法花费0.003秒将其计时在3900行文件上

    def c():
      import time
      s = time.time()
      file_length = len(open('myfile.txt','r').read().split('\n'))
      print time.time() - s
    
  • 1

    这个怎么样?

    import fileinput
    import sys
    
    counter=0
    for line in fileinput.input([sys.argv[1]]):
        counter+=1
    
    fileinput.close()
    print counter
    
  • 0

    你不能比这更好 .

    毕竟,任何解决方案都必须读取整个文件,找出你有多少 \n 并返回结果 .

    如果不阅读整个文件,你有更好的方法吗?不确定......最好的解决方案永远都是I / O绑定的,你能做的最好就是确保你不要使用不必要的内存,但看起来你已经覆盖了它 .

  • 5

    那这个呢

    def file_len(fname):
      counts = itertools.count()
      with open(fname) as f: 
        for _ in f: counts.next()
      return counts.next()
    
  • 1

    这是我用纯python发现的最快的东西 . 您可以通过设置缓冲区来使用任何数量的内存,但2 ** 16似乎是我计算机上的最佳位置 .

    from functools import partial
    
    buffer=2**16
    with open(myfile) as f:
            print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
    

    我在这里找到了答案Why is reading lines from stdin much slower in C++ than Python?并稍微调整了一下 . 这是一个非常好的阅读,以了解如何快速计算线,虽然 wc -l 仍然比其他任何东西快75% .

  • 2

    一行,可能很快:

    num_lines = sum(1 for line in open('myfile.txt'))
    
  • 8

    我不得不在类似的问题上发布这个问题,直到我的声望得分有所提高(感谢任何撞到我的人!) .

    所有这些解决方案都忽略了一种方法,使运行速度更快,即使用无缓冲(原始)接口,使用字节数组,并进行自己的缓冲 . (这仅适用于Python 3.在Python 2中,默认情况下可能使用或不使用原始接口,但在Python 3中,您将默认使用Unicode . )

    使用修改版的计时工具,我相信以下代码比任何提供的解决方案更快(并且更加pythonic):

    def rawcount(filename):
        f = open(filename, 'rb')
        lines = 0
        buf_size = 1024 * 1024
        read_f = f.raw.read
    
        buf = read_f(buf_size)
        while buf:
            lines += buf.count(b'\n')
            buf = read_f(buf_size)
    
        return lines
    

    使用单独的生成器函数,可以更快地运行:

    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024*1024)
    
    def rawgencount(filename):
        f = open(filename, 'rb')
        f_gen = _make_gen(f.raw.read)
        return sum( buf.count(b'\n') for buf in f_gen )
    

    这可以使用itertools在线生成器表达式完全完成,但它看起来非常奇怪:

    from itertools import (takewhile,repeat)
    
    def rawincount(filename):
        f = open(filename, 'rb')
        bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
        return sum( buf.count(b'\n') for buf in bufgen )
    

    这是我的时间:

    function      average, s  min, s   ratio
    rawincount        0.0043  0.0041   1.00
    rawgencount       0.0044  0.0042   1.01
    rawcount          0.0048  0.0045   1.09
    bufcount          0.008   0.0068   1.64
    wccount           0.01    0.0097   2.35
    itercount         0.014   0.014    3.41
    opcount           0.02    0.02     4.83
    kylecount         0.021   0.021    5.05
    simplecount       0.022   0.022    5.25
    mapcount          0.037   0.031    7.46
    
  • 6

    Kyle's answer

    num_lines = sum(1 for line in open('my_file.txt'))
    

    可能是最好的,另一种选择是

    num_lines =  len(open('my_file.txt').read().splitlines())
    

    以下是两者表现的比较

    In [20]: timeit sum(1 for line in open('Charts.ipynb'))
    100000 loops, best of 3: 9.79 µs per loop
    
    In [21]: timeit len(open('Charts.ipynb').read().splitlines())
    100000 loops, best of 3: 12 µs per loop
    
  • 81
    print open('file.txt', 'r').read().count("\n") + 1
    
  • 1

    一个类似于this answer的单行bash解决方案,使用现代 subprocess.check_output 函数:

    def line_count(file):
        return int(subprocess.check_output('wc -l {}'.format(file), shell=True).split()[0])
    
  • 8
    def count_text_file_lines(path):
        with open(path, 'rt') as file:
            line_count = sum(1 for _line in file)
        return line_count
    
  • 1

    你可以执行一个子进程并运行 wc -l filename

    import subprocess
    
    def file_len(fname):
        p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                                  stderr=subprocess.PIPE)
        result, err = p.communicate()
        if p.returncode != 0:
            raise IOError(err)
        return int(result.strip().split()[0])
    
  • 3
    def line_count(path):
        count = 0
        with open(path) as lines:
            for count, l in enumerate(lines, start=1):
                pass
        return count
    
  • 1

    只是为了完成上述方法,我尝试了一个带有fileinput模块的变体:

    import fileinput as fi   
    def filecount(fname):
            for line in fi.input(fname):
                pass
            return fi.lineno()
    

    并将60mil行文件传递给上述所有方法:

    mapcount : 6.1331050396
    simplecount : 4.588793993
    opcount : 4.42918205261
    filecount : 43.2780818939
    bufcount : 0.170812129974
    

    对我来说有点意外的是,fileinput很糟糕,并且比其他所有方法都要糟糕得多......

  • 7

    count = max(enumerate(open(filename)))[0]

相关问题