Pandas groupby总和使用两个DataFrames

时间:2016-08-29 05:55:22

标签: python pandas optimization dataframe sum

我有两个非常大的Pandas DataFrames,并希望使用它们在快速求和操作中互相引导。这两个框架看起来像这样:

Frame1中:

SampleName  Gene1   Gene2   Gene3
Sample1         1       2       3
Sample2         4       5       6
Sample3         7       8       9

(实际上,Frame1大约是1,000行x~300,000列)

式2:

FeatureName GeneID
Feature1    Gene1
Feature1    Gene3
Feature2    Gene1
Feature2    Gene2
Feature2    Gene3

(实际上,Frame2约为350,000行x 2列,约有17,000个独特的特征)

我想用Frame2的基因组来总结Frame1的列。例如,上述两个帧的输出将是:

SampleName  Feature1    Feature2
Sample1            4           6
Sample2           10          15
Sample3           16          24

(实际上,输出将是~1,000行x 17,000列)

有没有办法以最少的内存使用量来做到这一点?

3 个答案:

答案 0 :(得分:3)

如果你想减少内存使用量,我认为你最好选择迭代第一个DataFrame,因为它只有1k行。

dfs = []
frame1 = frame1.set_index('SampleName')
for idx, row in frame1.iterrows():
    dfs.append(frame2.join(row, on='GeneID').groupby('FeatureName').sum())
pd.concat(dfs, axis=1).T

产量

FeatureName  Feature1  Feature2
Sample1             4         6
Sample2            10        15
Sample3            16        24

答案 1 :(得分:2)

您可以先创建MultiIndex.from_tuples,然后创建reindex列,最后groupby

#create MultiIndex from df2
cols = pd.MultiIndex.from_tuples(list(zip(df2.FeatureName, df2.GeneID)),
       names=('FeatureName','GeneID'))
print (cols)
MultiIndex(levels=[['Feature1', 'Feature2'], ['Gene1', 'Gene2', 'Gene3']],
           labels=[[0, 0, 1, 1, 1], [0, 2, 0, 1, 2]],
           names=['FeatureName', 'GeneID'])

#reindex columns by MultiIndex           
df = df1.set_index('SampleName').reindex(columns=cols, level=1)
print (df)
FeatureName Feature1       Feature2            
GeneID         Gene1 Gene3    Gene1 Gene2 Gene3
SampleName                                     
Sample1            1     3        1     2     3
Sample2            4     6        4     5     6
Sample3            7     9        7     8     9

#groupby by level 0 of columns and aggregate sum
print (df.groupby(level=0, axis=1).sum())
FeatureName  Feature1  Feature2
SampleName                     
Sample1             4         6
Sample2            10        15
Sample3            16        24

答案 2 :(得分:1)

一条讨厌的行

Frame1.set_index('SampleName') \
    .rename_axis('GeneID', axis=1) \
    .stack().rename('Value') \
    .reset_index().merge(Frame2) \
    .groupby(['SampleName', 'FeatureName']) \
    .Value.sum().unstack()

enter image description here