我有一个非常大的(大于便宜的RAM)CSV数据集,我正在使用pandas。现在我正在做类似
的事情df = pd.read_csv('verylargefile.csv', chunksize=10000)
for df_chunk in df:
df_chunk['new_column'] = df_chunk['old_column'].apply(my_func)
# do other operations and filters...
df_chunk.to_csv('processed.csv', mode='a')
所以我可以做我需要对数据集进行操作并将输出保存到另一个文件。
问题始于对此数据集进行一些分组和统计:我需要计算整个数据集的均值,标准差和直方图,并绘制结果,趋势,使用statsmodels等图表。因为样本不是将是同质的我无法计算统计数据。
df.groupby('column_1').sum()
TypeError: Cannot groupby on a TextFileReader
我没有通常的选择,只选择几列,我不知道如何将数据存储在HDF上会有所帮助。有办法吗?
答案 0 :(得分:2)
这看起来像dask.dataframe
可以提供帮助的情况。我在下面列举了一个例子。
有关详细信息,请参阅dask documentation和非常好的tutorial。
roo> entity jpa --class com.model.vo.FileVO
Created SRC_MAIN_JAVA\...\vo\FileVO.java
Created SRC_MAIN_JAVA\...\vo\FileVO_Roo_Configurable.aj
Created SRC_MAIN_JAVA\...\vo\FileVO_Roo_ToString.aj
Created SRC_MAIN_JAVA\...\vo\FileVO_Roo_Jpa_ActiveRecord.aj
Created SRC_MAIN_JAVA\...\vo\FileVO_Roo_Jpa_Entity.aj
Created SRC_TEST_JAVA\...\vo\FileVODataOnDemand_Roo_DataOnDemand.aj
Created SRC_TEST_JAVA\...\vo\FileVOIntegrationTest_Roo_IntegrationTest.aj
~.model.vo.FileVO roo> Deleted SRC_TEST_JAVA\...\vo\FileVODataOnDemand_Roo_DataOnDemand.aj - empty
Deleted SRC_TEST_JAVA\...\vo\FileVOIntegrationTest_Roo_IntegrationTest.aj - empty
为了比较,这里是使用In [1]: import dask.dataframe as dd
In [2]: !head data/accounts.0.csv
id,names,amount
138,Ray,1502
1,Tim,5
388,Ingrid,45
202,Zelda,1324
336,Jerry,-1607
456,Laura,-2832
65,Laura,-326
54,Yvonne,341
92,Sarah,3136
In [3]: dask_df = dd.read_csv('data/accounts.0.csv', chunkbytes=4000000)
In [4]: dask_df.npartitions
Out[4]: 4
In [5]: len(dask_df)
Out[5]: 1000000
In [6]: result = dask_df.groupby('names').sum()
In [7]: result.compute()
Out[7]:
id amount
names
Alice 10282524 43233084
Bob 8617531 47512276
Charlie 8056803 47729638
Dan 10146581 32513817
Edith 15164281 37806024
Frank 11310157 63869156
George 14941235 80436603
Hannah 3006336 25721574
Ingrid 10123877 54152865
Jerry 10317245 8613040
Kevin 6809100 16755317
Laura 9941112 34723055
Michael 11200937 36431387
Norbert 5715799 14482698
Oliver 10423117 32415534
Patricia 15289085 22767501
Quinn 10686459 16083432
Ray 10156429 9455663
Sarah 7977036 34970428
Tim 12283567 47851141
Ursula 4893696 37942347
Victor 8864468 15542688
Wendy 9348077 68824579
Xavier 6600945 -3482124
Yvonne 5665415 12701550
Zelda 8491817 42573021
的结果。我在这里使用的数据适合内存,但即使数据大于内存,pandas
也能正常工作。
dask
答案 1 :(得分:1)
df
的类型不是dataframe
,而是TextFileReader
。我认为你需要通过函数concat
将所有块连接到数据帧,然后应用函数:
df = pd.read_csv('verylargefile.csv', chunksize=10000) # gives TextFileReader
df_chunk = concat(df, ignore_index=True)
df_chunk['new_column'] = df_chunk['old_column'].apply(my_func)
# do other operations and filters...
df_chunk.to_csv('processed.csv', mode='a')
编辑:
也许有助于这种方法:按群组处理大型数据框架:
示例:
import pandas as pd
import numpy as np
import io
#test data
temp=u"""id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,16,13
3,14,15,13
3,14,185,213"""
df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id', 'col1'])
#drop duplicities, from out you can choose constant
df = df.drop_duplicates()
print df
# id col1
#0 1 13
#2 1 12
#3 2 18
#9 3 14
#for example list of constants
constants = [1,2,3]
#or column id to list of unique values
constants = df['id'].unique().tolist()
print constants
#[1L, 2L, 3L]
for i in constants:
iter_csv = pd.read_csv(io.StringIO(temp), delimiter=",", chunksize=10)
#concat subset with rows id == constant
df = pd.concat([chunk[chunk['id'] == i] for chunk in iter_csv])
#your groupby function
data = df.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
print data.to_csv(index=False)
#id,col1,col2,col3
#1,12,15,13
#1,13,30,28
#
#id,col1,col2,col3
#2,18,90,78
#
#id,col1,col2,col3
#3,14,215,239