通过csv文件中的块读取和反转数据块并复制到新的csv文件-Java 学习之路

假设我正在处理一个非常大的csv文件 . 所以，我只能通过chunk将数据块读入内存 . 预期的事件流程应如下：

1）使用pandas从csv读取数据块（例如：10行） . 2）反转数据的顺序3）反过来将每一行复制到新的csv文件 . 所以每个块（10行）从反向开始写入csv .

最后，csv文件应该颠倒顺序，这应该在不将整个文件加载到Windows OS的内存中的情况下完成 .

我正在尝试做一个时间序列预测我需要数据从旧到最新（第一行最旧的条目） . 我无法将整个文件加载到内存中我正在寻找一种方法，如果可能的话，每次都可以执行每个块 .

我在Rossmann dataset的 train.csv 上尝试了数据集来自kaggle . 你可以从这个github repo得到它

我的尝试不会正确地将行复制到新的csv文件中 .

下面显示的是我的代码：

import pandas as pd
import csv

def reverse():

    fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
              "SchoolHoliday"]
    with open('processed_train.csv', mode='a') as stock_file:
        writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
                                                quoting=csv.QUOTE_MINIMAL)
        writer.writerow(fields)

    for chunk in pd.read_csv("train.csv", chunksize=10):
        store_data = chunk.reindex(index=chunk.index[::-1])
        append_data_csv(store_data)

def append_data_csv(store_data):
    with open('processed_train.csv', mode='a') as store_file:
        writer = csv.writer(store_file,delimiter=',', quotechar='"',
                                           quoting=csv.QUOTE_MINIMAL)
        for index, row in store_data.iterrows():
            print(row)
            writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
            row['Customers'],row['Open'],row['Promo'],
            row['StateHoliday'],row['SchoolHoliday']])

reverse()

先感谢您

5 回答

3
使用bash，您可以拖尾除第一行之外的整个文件，然后将其反转并将其存储为：
```
tail -n +2 train.csv  | tac > train_rev.csv
```
如果要将 Headers 保留在反转文件中，请先将其写入，然后附加反转内容
```
head -1 train.csv > train_rev.csv; tail -n +2 train.csv  | tac >> train_rev.csv
```
回复于 2024-04-25T17:30:58+08:00
0

你有重复的代码块，你根本没有利用熊猫 .

@sujay kumar指出的是非常正确的，我会更仔细地阅读 .

该文件根本不重要 . 我使用GB中的OHLCV刻度数据没有问题 . 如果您使用 pandas.read_csv() ，则无需进行分块传输 . 当然需要时间，但它会正常工作 . 除非你要去Terrabytes . 我还没有测试过 .

当你 read_csv() 你没有指定任何索引 . 如果你这样做，你可以根据订单使用或不使用 ascending=False 来调用 sort_index() .

熊猫也可以写CSV，请使用它 . 我正在粘贴一些示例代码供您整理 .

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

排序系列

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

注意：如果您坚持Pandas及其功能，您将运行已经优化的代码，不需要将整个文件加载到内存中 . 这很容易，它几乎像作弊:)

回复于 2024-04-25T17:30:58+08:00

-3

如果您有足够的硬盘空间，则可以读取块，反向和存储 . 然后以相反的顺序拾取存储的块并写入新的csv文件 .

以下是Pandas的示例，它还使用pickle（用于提高性能）和gzip（用于存储效率） .

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_{idx:03}.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
        chunk_pkl = pd.read_pickle(f'file_pkl_{i:03}.pkl.gzip', compression='gzip')
        chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

    0   1   2   3   4
0  40  41  42  43  44
1  35  36  37  38  39
2  30  31  32  33  34
3  25  26  27  28  29
4  20  21  22  23  24
5  15  16  17  18  19
6  10  11  12  13  14
7   5   6   7   8   9
8   0   1   2   3   4

回复于 2024-04-25T17:30:58+08:00

这完全符合您的要求，但没有Pandas . 它逐行读取intest.csv（而不是将整个文件读入RAM） . 它使用文件系统进行大部分处理，使用一系列最终聚合到outtest.csv文件中的块文件 . 如果更改maxLines，则可以优化生成的块文件数量与消耗的RAM量（更高的数量消耗更多的RAM但产生更少的块文件） . 如果要将CSV标头保留在第一行，请将keepHeader设置为True;如果设置为False，则会反转整个文件，包括第一行 .

对于踢，我在一个旧的Raspberry Pi上使用128GB闪存驱动器在6MB csv测试文件上运行它，我认为出了问题，因为它几乎立即返回，所以即使在较慢的硬件上它也很快 . 它只导入一个标准的python库函数（删除），因此它非常便携 . 此代码的一个优点是它不会重新定位任何文件指针 . 一个限制是它不适用于在数据中有换行符的CSV文件 . 对于该用例，pandas将是读取块的最佳解决方案 .

from os import remove

def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
        outfp.write(reverseString)
    return

def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
        reverseString = ''
        line = infp.readline()
        if (line and keepHeader):
            headerLine = line
            line = infp.readline()
        while (line):
            lineCounter += 1
            reverseString = line + reverseString
            if (lineCounter == maxLines):
                fileCounter += 1
                lineCounter = 0
                writechunk(fileCounter, reverseString)
                reverseString = ''
            line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
        fileCounter += 1
        writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
        if (keepHeader):
            outfp.write(headerLine)
        while (fileCounter > 0):
            chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
            with open(chunkFile, 'r') as infp:
                outfp.write(infp.read())
            remove(chunkFile)
            fileCounter -= 1

if __name__ == '__main__':
    main()

回复于 2024-04-25T17:30:58+08:00

我不建议使用 pandas 来解析或流式传输任何文件，因为您只会引入额外的开销 . 最好的方法是从下往上读取文件 . 好吧，这段代码的很大一部分实际上来自here，它接收一个文件并在生成器中返回反向，我相信这就是你想要的 .

我所做的只是使用您提供的链接中的文件 train.csv 对其进行测试，并将结果输出到一个新文件中 .

import os

def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
        segment = None
        offset = 0
        fh.seek(0, os.SEEK_END)
        file_size = remaining_size = fh.tell()
        while remaining_size > 0:
            offset = min(file_size, offset + buf_size)
            fh.seek(file_size - offset)
            buffer = fh.read(min(remaining_size, buf_size))
            remaining_size -= buf_size
            lines = buffer.split('\n')
            # the first line of the buffer is probably not a complete line so
            # we'll save it and append it to the last line of the next buffer
            # we read
            if segment is not None:
                # if the previous chunk starts right from the beginning of line
                # do not concact the segment to the last line of new chunk
                # instead, yield the segment first 
                if buffer[-1] != '\n':
                    lines[-1] += segment
                else:
                    yield segment
            segment = lines[0]
            for index in range(len(lines) - 1, 0, -1):
                if lines[index]:
                    yield lines[index]
        # Don't yield None if the file was empty
        if segment is not None:
            yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
    for row in reverse_gen:
        f.write('{}\n'.format(row))

它基本上反向读取它，直到找到换行符然后从文件底部到顶部从文件中产生 line . 这是一种非常有趣的方式 .

回复于 2024-04-25T17:30:58+08:00

通过csv文件中的块读取和反转数据块并复制到新的csv文件

5 回答

相关问题