Question

我正在处理大型csv文件。由于内存限制，由于无法将整个csv文件同时导入一个数据帧，因此我正在使用块来处理数据。

df = pd.read_csv(filepath, chunksize = chunksize)
for chunk in df:
    print(chunk['col2'].describe())

这为我提供了每个块的统计信息。有没有一种方法可以合并每个chunk.describe（）调用中要合并的结果，以便我可以一次获取所有数据的统计信息？

我现在唯一想到的方法是维护一个字典来存储统计信息并随着每次迭代进行更新。

Answer 1

已编辑：

我需要对此稍作练习。我是新手，所以要加一点盐：

使用远程源加载样本

import pandas as pd

df1_iter = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv", 
                       chunksize=5, 
                       iterator=True)

做一个简单的for外观来对每个块进行.describe和.T并将其附加到列表中

接下来在pd.concat()上使用df_list

df_list = []

for chunk in df1_iter:
    df_list.append(chunk.describe().T)

df_concat = pd.concat(df_list)

Groupby
对于agg，我使用了我认为有用的功能，请根据需要进行调整。

desc_df = df_concat.groupby(df_concat.index).agg(
    {
        'mean':'mean', 
        'std': 'std',
        'min': 'min',
        '25%': 'mean', 
        '50%': 'mean', 
        '75%': 'mean', 
        'max': 'max'
    }
)

print(desc_df)

            mean        std     min         25%         50%         75%      max
am      0.433333   0.223607   0.000    0.333333    0.500000    0.500000    1.000
carb    3.100000   1.293135   1.000    2.250000    2.666667    4.083333    8.000
cyl     6.200000   0.636339   4.000    5.500000    6.000000    7.166667    8.000
disp  232.336667  40.954447  71.100  177.216667  195.233333  281.966667  472.000
drat    3.622833   0.161794   2.760    3.340417    3.649167    3.849583    4.930
gear    3.783333   0.239882   3.000    3.541667    3.916667    3.958333    5.000
hp    158.733333  44.053017  52.000  124.416667  139.333333  191.083333  335.000
mpg    19.753333   2.968229  10.400   16.583333   20.950000   23.133333   33.900
qsec   17.747000   0.868257  14.500   16.948333   17.808333   18.248333   22.900
vs      0.450000   0.102315   0.000    0.208333    0.416667    0.625000    1.000
wt      3.266900   0.598493   1.513    2.850417    3.042500    3.809583    5.424

我希望这会有所帮助。

有没有更简单的方法来合并来自DataFrame多个块的describe（）结果？

1 个答案:

已编辑：