我有以下数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'category': ['ctr','ctr','ctr','ctr','ctr','ctr'],
'expected_count': [100,100,112,1.3,14,125],
'sample_id': ['S1','S1','S1','S2','S2','S2'],
'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],
})
产生这个:
In [2]: df
Out[2]:
category expected_count gene_symbol sample_id
0 ctr 100.0 a S1
1 ctr 100.0 b S1
2 ctr 112.0 c S1
3 ctr 1.3 a S2
4 ctr 14.0 b S2
5 ctr 125.0 c S2
我没有问题用基因符号分组:
In [4]: gdf = df.groupby(by = 'gene_symbol')['expected_count'].mean()
...: gdf
...:
Out[4]:
gene_symbol
a 50.65
b 57.00
c 118.50
Name: expected_count, dtype: float64
In [5]: str(gdf)
Out[5]: 'gene_symbol\na 50.65\nb 57.00\nc 118.50\nName: expected_count, dtype: float64'
请注意gdf
是一个字符串。如何将其转换为数据框?
答案 0 :(得分:1)
需要as_index=False
或reset_index
:
gdf = df.groupby('gene_symbol', as_index=False)['expected_count'].mean()
print (gdf)
gene_symbol expected_count
0 a 50.65
1 b 57.00
2 c 118.50
或者:
gdf = df.groupby('gene_symbol')['expected_count'].mean().reset_index()
print (gdf)
gene_symbol expected_count
0 a 50.65
1 b 57.00
2 c 118.50
输出不是string
,而是Series
:
print (type(df.groupby('gene_symbol')['expected_count'].mean()))
<class 'pandas.core.series.Series'>
答案 1 :(得分:1)
您可以使用:
gdf = df.groupby(by = 'gene_symbol')['expected_count'].mean().to_frame()
gdf
Out[149]:
expected_count
gene_symbol
a 50.65
b 57.00
c 118.50