Question

我使用Latent Dirichlet Allocation为5000多个txt文档创建了一个包含20个主题的主题模型。我现在有一个.csv文件，其中包含三列：文档编号，主题编号和文档中的主题概率。它看起来像这样（对于文档n°1和n°2）：

1   1   0,113
1   4   0,2
1   7   0,156
1   17  0,065
1   18  0,463
2   1   0,44
2   6   0,207
2   14  0,103
2   16  0,126
2   17  0,015
2   18  0,106

基本上，我想知道某个主题的主题概率最高的文档列表。

我想我必须做以下事情：

1）为第1列中的每个相同值（称为 doc_number ）获取第3列中的最高值（称之为 highest_prob ）。

2）对于获得的每个 doc_number （应该有多少文档），在第2列中获取相应的主题编号（称之为 topic_number ）

3）返回与我感兴趣的特定 topic_number 相关联的 doc_number 列表。

我是python的新手，不知道如何继续使用csv包或pandas ......

Answer 1

您可以先在,列中replace .到probability，然后按astype转换为float。然后按document_number列groupby获取index列probability的最大值import pandas as pd df = pd.DataFrame({'document_number': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2}, 'probability': {0: '0,113', 1: '0,2', 2: '0,156', 3: '0,065', 4: '0,463', 5: '0,44', 6: '0,207', 7: '0,103', 8: '0,126', 9: '0,015', 10: '0,106'}, 'topic_number': {0: 1, 1: 4, 2: 7, 3: 17, 4: 18, 5: 1, 6: 6, 7: 14, 8: 16, 9: 17, 10: 18}}, columns = ['document_number','topic_number','probability']) print (df) document_number topic_number probability 0 1 1 0,113 1 1 4 0,2 2 1 7 0,156 3 1 17 0,065 4 1 18 0,463 5 2 1 0,44 6 2 6 0,207 7 2 14 0,103 8 2 16 0,126 9 2 17 0,015 10 2 18 0,106 idxmax。最后通过loc获取所有记录：

df['probability'] = df.probability.str.replace(',','.').astype(float)

print (df.groupby('document_number')['probability'].idxmax())
1    4
2    5
Name: probability, dtype: int64

print (df.loc[df.groupby('document_number')['probability'].idxmax()])
   document_number  topic_number  probability
4                1            18        0.463
5                2             1        0.440

document_number

来自topic_number列的上一个set_index并转换为to_dict列print (df.loc[df.groupby('document_number')['probability'].idxmax()] .set_index('document_number')['topic_number']) document_number 1 18 2 1 Name: topic_number, dtype: int64 print (df.loc[df.groupby('document_number')['probability'].idxmax()] .set_index('document_number')['topic_number'].to_dict()) {1: 18, 2: 1}：

probability

另一个解决方案首先sort_values列print (df.sort_values(by="probability", ascending=False) .groupby('document_number', as_index=False) .first()) document_number topic_number probability 0 1 18 0.463 1 2 1 0.440 print (df.sort_values(by="probability", ascending=False) .groupby('document_number', as_index=False) .first().set_index('document_number')['topic_number']) document_number 1 18 2 1 Name: topic_number, dtype: int64 print (df.sort_values(by="probability", ascending=False) .groupby('document_number', as_index=False) .first().set_index('document_number')['topic_number'].to_dict()) {1: 18, 2: 1}，然后是groupby，汇总first：

{{1}}

在python中选择具有相同值的行中具有最高列值的csv行

1 个答案: