我有8GB的csv文件和8GB的RAM。每种文件在这种形式下每行有两个字符串:
a,c
c,a
f,g
a,c
c,a
b,f
c,a
对于较小的文件,我删除重复项,计算前两列中每行的副本数,然后将字符串重新编码为整数as follows:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("file.txt", header=None, prefix="ID_")
# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
这给出了:
ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1
这正是我对这个玩具示例所需要的。
对于较大的文件,由于缺少RAM,我无法采取这些步骤。
我可以想象有可能将unix排序和定制的python解决方案结合起来对数据进行多次传递来处理我的数据集。但有人建议dask可能适合。阅读完文档后,我仍然不清楚。
可以使用dask来进行这种核心处理,还是有其他一些核心熊猫解决方案?
答案 0 :(得分:2)
假设分组的数据框适合您的内存,您必须对代码进行的更改应该非常小。这是我的尝试:
import pandas as pd
from dask import dataframe as dd
from sklearn.preprocessing import LabelEncoder
# import the data as dask dataframe, 100mb per partition
# note, that at this point no data is read yet, dask will read the files
# once compute or get is called.
df = dd.read_csv("file.txt", header=None, prefix="ID_", blocksize=100000000)
# Perform the groupby (before converting letters to digits).
# For better understanding, let's split this into two parts:
# (i) define the groupby operation on the dask dataframe and call compute()
# (ii) compute returns a pandas dataframe, which we can then use for further analysis
pandas_df = df.groupby(['ID_0', 'ID_1']).apply(lambda x: len(x), columns=0).compute()
pandas_df = pandas_df.rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(pandas_df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
pandas_df[['ID_0', 'ID_1']] = pandas_df[['ID_0', 'ID_1']].apply(le.transform)
pandas中的一个可能的解决方案是以块的形式读取文件(将chunksize参数传递给read_csv),在各个块上运行groupby并组合结果。
以下是如何在纯python中解决问题:
counts = {}
with open('data') as fp:
for line in fp:
id1, id2 = line.rstrip().split(',')
counts[(id1, id2)] = 1 + counts.get((id1, id2), 0)
df = pd.DataFrame(data=[(k[0], k[1], v) for k, v in counts.items()],
columns=['ID_0', 'ID_1', 'count'])
# apply label encoding etc.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)