我正在尝试使用python中的pandas从我的数据(化学物质和蛋白质之间的分数)创建数据帧。
我希望我的数据帧首先显示出现次数最多的蛋白质,所以我之前对数据进行了排序。但是当我创建数据帧时,它没有得到预期的结果。
以下是我的数据示例:
chemicals prots scores
CID000000006 10116.ENSRNOP00000003921 196
CID000000051 10116.ENSRNOP00000003921 246
CID000000085 10116.ENSRNOP00000003921 196
CID000000119 10116.ENSRNOP00000003921 247
CID000000134 10116.ENSRNOP00000008952 159
CID000000135 10116.ENSRNOP00000008952 157
CID000000174 10116.ENSRNOP00000008952 439
CID000000175 10116.ENSRNOP00000001021 858
CID000000177 10116.ENSRNOP00000004027 760
如您所见,“10116.ENSRNOP00000003921”是我数据中出现次数最多的蛋白质。
所以我希望得到类似的东西:
10116.ENSRNOP00000003921 10116.ENSRNOP00000008952
CID000000006 196
CID000000051 246
CID000000085 196
CID000000119 247
CID000000134 159
CID000000135 157
CID000000174 439
这是我的代码:
import pandas as pd
df_rat= pd.read_csv("dt_matrix_rat.csv",sep="\t", header=True)
df_rat.columns = ['chemicals','proteins','scores']
df_rat1 = df_rat.pivot(index='chemicals', columns='proteins', values='scores')
df_rat1.to_csv("rat_matrix.csv", sep='\t', index=True )
答案 0 :(得分:0)
你可以使用@ jezrael的解决方案,也可以这样做(非常相似):
In [136]: df
Out[136]:
chemicals prots scores
0 CID000000006 10116.ENSRNOP00000003921 196
1 CID000000051 10116.ENSRNOP00000003921 246
2 CID000000085 10116.ENSRNOP00000003921 196
3 CID000000119 10116.ENSRNOP00000003921 247
4 CID000000134 10116.ENSRNOP00000008952 159
5 CID000000135 10116.ENSRNOP00000008952 157
6 CID000000174 10116.ENSRNOP00000008952 439
7 CID000000175 10116.ENSRNOP00000001021 858
8 CID000000177 10116.ENSRNOP00000004027 760
准备正确的订单
In [169]: df.groupby('prots').sum().sort('scores', ascending=False)
Out[169]:
scores
prots
10116.ENSRNOP00000003921 885
10116.ENSRNOP00000001021 858
10116.ENSRNOP00000004027 760
10116.ENSRNOP00000008952 755
准备已排序列的列表(对于旧版本的pandas)使用.sort()
而不是.sort_values()
:
In [170]: cols = df.groupby('prots').sum().sort_values(by='scores', ascending=False).index
In [171]: cols
Out[171]:
Index(['10116.ENSRNOP00000003921', '10116.ENSRNOP00000001021',
'10116.ENSRNOP00000004027', '10116.ENSRNOP00000008952'],
dtype='object', name='prots')
以正确的顺序旋转并设置列:
In [175]: df_rat1 = df.pivot(index='chemicals', columns='prots', values='scores').fillna('')
In [176]: df_rat1 = df_rat1[cols]
In [177]: df_rat1
Out[177]:
prots 10116.ENSRNOP00000003921 10116.ENSRNOP00000001021 10116.ENSRNOP00000004027 10116.ENSRNOP00000008952
chemicals
CID000000006 196
CID000000051 246
CID000000085 196
CID000000119 247
CID000000134 159
CID000000135 157
CID000000174 439
CID000000175 858
CID000000177 760
答案 1 :(得分:0)
我认为您需要sort_values
notnull
sum
并获得cols
的索引。 Lasy使用subset
:
df1 = df.pivot(index='chemicals', columns='proteins', values='scores')
cols = df1.notnull().sum(axis=0).sort_values(ascending=False).index
print cols
Index([u'10116.ENSRNOP00000003921', u'10116.ENSRNOP00000008952',
u'10116.ENSRNOP00000004027', u'10116.ENSRNOP00000001021'],
dtype='object', name=u'proteins')
print df1[cols]
proteins 10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \
chemicals
CID000000006 196.0 NaN
CID000000051 246.0 NaN
CID000000085 196.0 NaN
CID000000119 247.0 NaN
CID000000134 NaN 159.0
CID000000135 NaN 157.0
CID000000174 NaN 439.0
CID000000175 NaN NaN
CID000000177 NaN NaN
proteins 10116.ENSRNOP00000004027 10116.ENSRNOP00000001021
chemicals
CID000000006 NaN NaN
CID000000051 NaN NaN
CID000000085 NaN NaN
CID000000119 NaN NaN
CID000000134 NaN NaN
CID000000135 NaN NaN
CID000000174 NaN NaN
CID000000175 NaN 858.0
CID000000177 760.0 NaN
print df1.reindex_axis(cols, axis=1)
proteins 10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \
chemicals
CID000000006 196.0 NaN
CID000000051 246.0 NaN
CID000000085 196.0 NaN
CID000000119 247.0 NaN
CID000000134 NaN 159.0
CID000000135 NaN 157.0
CID000000174 NaN 439.0
CID000000175 NaN NaN
CID000000177 NaN NaN
proteins 10116.ENSRNOP00000004027 10116.ENSRNOP00000001021
chemicals
CID000000006 NaN NaN
CID000000051 NaN NaN
CID000000085 NaN NaN
CID000000119 NaN NaN
CID000000134 NaN NaN
CID000000135 NaN NaN
CID000000174 NaN NaN
CID000000175 NaN 858.0
CID000000177 760.0 NaN