在Pandas中生成排名最高的值列

时间:2015-11-06 21:48:08

标签: python pandas

我有一个数据框topic_data,其中包含LDA主题模型的输出:

topic_data.head(15)

    topic                      word     score
0       0                Automobile  0.063986
1       0                   Vehicle  0.017457
2       0                Horsepower  0.015675
3       0                    Engine  0.014857
4       0                   Bicycle  0.013919
5       1                     Sport  0.032938
6       1      Association_football  0.025324
7       1                Basketball  0.020949
8       1                  Baseball  0.016935
9       1  National_Football_League  0.016597
10      2                     Japan  0.051454
11      2                      Beer  0.032839
12      2                   Alcohol  0.027909
13      2                     Drink  0.019494
14      2                     Vodka  0.017908

这显示了每个主题的前5个术语,以及每个主题的得分(权重)。我尝试做的是重新格式化,以便索引是术语的排名,列是主题ID,值是从wordscore列生成的格式化字符串("%s (%.02f)" % (word,score))的内容。这意味着新数据框应如下所示:

Topic  0                1                            ...
Rank
  0  Automobile (0.06)  Sport (0.03)                 ...
  1  Vehicle (0.017)    Association_football (0.03)  ...
 ... ...                ...                          ...

正确的解决方法是什么?我认为它涉及索引设置,取消堆叠和排名的组合,但我不确定正确的方法。

1 个答案:

答案 0 :(得分:2)

这将是这样的,请注意必须首先生成Rank

In [140]:
df['Rank']    = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2           = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')

topic                  0                                1               2
Rank                                                                     
0      Automobile (0.06)                     Sport (0.03)    Japan (0.05)
1         Vehicle (0.02)      Association_football (0.03)     Beer (0.03)
2      Horsepower (0.02)                Basketball (0.02)  Alcohol (0.03)
3          Engine (0.01)                  Baseball (0.02)    Drink (0.02)
4         Bicycle (0.01)  National_Football_League (0.02)    Vodka (0.02)