在Python Pandas中使用10 GB数据集-Java 学习之路

我有一个非常大的.csv（最初来自SAS数据集），它包含以下列：

target_series  metric_series        month   metric_1  metric_2  target_metric
1              1                    1       #float    #float    #float
1              1                    2       #float    #float    #float
...
1              1                    60      #float    #float    #float
1              2                    1       #float    #float    #float
1              2                    2       #float    #float    #float
...
1              80000                60      #float    #float    #float
2              1                    1       #float    #float    #float
...
50             80000                60      #float    #float    #float

如您所见，该文件具有 60 个月 80000 独立系列时间 50 目标系列的行数，并且当保存为 .csv 时占用超过10 GB的空间 . 我需要做的是执行并记录每个 metric_1 和 metric_2 与 target_metric 之间的相关性 .

我写了以下代码：

import pandas as pd
from datetime import datetime

data = pd.read_csv("data.csv")  # approximately 10 GB 
output = []

for target_num in range(1,50):
    for metric_number in range(1,80000):
        startTime = datetime.now()  # Begin the timer
        current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]  # Select the current 60 months period that we want to perform the correlation on
        print('The process took: '+str(datetime.now() - startTime)+' hours:minutes:seconds to complete.')  # Stop the timer
        results_amount_target = current_df[['metric_1','target_metric']].corr()  # Perform metric_1 correlation
        results_count_target = current_df[['metric_2','target_metric']].corr()  # Perform metric_2 correlation

        output.append([target_num, independent_number, results_amount_target.iat[0,1], results_count_target.iat[0,1]])  # Record the correlation in a Python list will be converted to a DataFrame later

我在那里有 datetime 代码的原因是要找出为什么这个过程需要这么长时间 . 计时器代码缠绕在 current_df 线上，这是迄今为止最慢的（我已经玩了 datetime 的位置，以找出为什么这么长时间 .

我发现用这行代码选择部分数据：

current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]

每次需要1.5秒 . 这对我来说太慢了！显然需要改变一些事情！

我决定尝试不同的方法 . 因为我知道我想一次遍历数据集60行（对于每个 target_series 和 metric_series 对），我会尝试以下两种方法之一：

从 data.csv 读入前60行，执行相关，然后从 data.csv 读入接下来的60行，使用以下代码 data = pd.read_csv('data.csv', nrows=60,skiprows=60) . 虽然这对于数据集的第一部分来说速度更快，但由于我不得不跳过数据集，所以它变得无法忍受 . 我的电脑上数据集的最后60行读了10多分钟！
使用 data.head(60) 读取存储在内存中的 DataFrame 的前60行，然后使用 data = data.drop(data.head(60).index) 从数据框中删除该数据，但这甚至更慢！

此时，我正在探索使用 HDFStore 或 h5py 将数据集从 .csv 移动到 .h5 ，但我不确定如何继续 . 我正在进行此分析的计算机只有16 GB的内存，将来我可以使用比这个文件更大的数据 .

什么是解决这个问题的最佳方法，我如何准备在Python / Pandas中处理更大的数据？

UPDATE

感谢 filmor ，我已将我的代码重写为以下内容：

import pandas as pd
from datetime import datetime

data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable
output = []

for chunk in data:
    results_amount_target = chunk[['metric_1','target_metric']].corr()  # Perform metric_1 correlation
    results_count_target = chunk[['metric_2','target_metric']].corr()  # Perform metric_2 correlation

    output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]])  # Record the correlation in a Python list will be converted to a DataFrame later

这是现在超快速和记忆灯！如果有人可以通过 HDFStore 或 .h5 文件告诉我如何执行此操作，我仍然会感激不尽 .

在Python Pandas中使用10 GB数据集

相关问题