我有两个非常大的Pandas DataFrames,并希望使用它们在快速求和操作中互相引导。这两个框架看起来像这样:
Frame1中:
SampleName Gene1 Gene2 Gene3
Sample1 1 2 3
Sample2 4 5 6
Sample3 7 8 9
(实际上,Frame1大约是1,000行x~300,000列)
式2:
FeatureName GeneID
Feature1 Gene1
Feature1 Gene3
Feature2 Gene1
Feature2 Gene2
Feature2 Gene3
(实际上,Frame2
约为350,000行x 2列,约有17,000个独特的特征)
我想用Frame2的基因组来总结Frame1的列。例如,上述两个帧的输出将是:
SampleName Feature1 Feature2
Sample1 4 6
Sample2 10 15
Sample3 16 24
(实际上,输出将是~1,000行x 17,000列)
有没有办法以最少的内存使用量来做到这一点?
答案 0 :(得分:3)
如果你想减少内存使用量,我认为你最好选择迭代第一个DataFrame,因为它只有1k行。
dfs = []
frame1 = frame1.set_index('SampleName')
for idx, row in frame1.iterrows():
dfs.append(frame2.join(row, on='GeneID').groupby('FeatureName').sum())
pd.concat(dfs, axis=1).T
产量
FeatureName Feature1 Feature2
Sample1 4 6
Sample2 10 15
Sample3 16 24
答案 1 :(得分:2)
您可以先创建MultiIndex.from_tuples
,然后创建reindex
列,最后groupby
:
#create MultiIndex from df2
cols = pd.MultiIndex.from_tuples(list(zip(df2.FeatureName, df2.GeneID)),
names=('FeatureName','GeneID'))
print (cols)
MultiIndex(levels=[['Feature1', 'Feature2'], ['Gene1', 'Gene2', 'Gene3']],
labels=[[0, 0, 1, 1, 1], [0, 2, 0, 1, 2]],
names=['FeatureName', 'GeneID'])
#reindex columns by MultiIndex
df = df1.set_index('SampleName').reindex(columns=cols, level=1)
print (df)
FeatureName Feature1 Feature2
GeneID Gene1 Gene3 Gene1 Gene2 Gene3
SampleName
Sample1 1 3 1 2 3
Sample2 4 6 4 5 6
Sample3 7 9 7 8 9
#groupby by level 0 of columns and aggregate sum
print (df.groupby(level=0, axis=1).sum())
FeatureName Feature1 Feature2
SampleName
Sample1 4 6
Sample2 10 15
Sample3 16 24
答案 2 :(得分:1)