我有一个mapreduce作业的信息,该信息存储在文本文件中,格式如下:
Book Title:Token Count
Book1:Word1 5
Book2:Word1 6
Book1:Word2 2
我正在尝试将其转换为如下所示的数据框,该数据框将在第一列中列出所有唯一单词,然后在相邻列中列出每个单词在每个相应文档中出现的次数:
到目前为止,我有以下代码将“书名”,“令牌”和“计数”与文本文件的每一行分开
for line in unigrams:
token, count = line.strip().split("\t")
document = token.split(":")[0]
word = token.split(":")[1]
x[i] = {'Document': document, 'Word' : word.strip(), 'Count' : count.strip()}
字典在for循环外设置,而i在底部递增。然后,将以下行将字典'x'转换为数据框
df = pd.DataFrame.from_dict(x, orient="index")
任何有关如何修改上述代码以实现上述结果的指导将不胜感激。预先感谢。
答案 0 :(得分:0)
Pandas为您的用例提供了方便的数据透视表。
import pandas as pd
x = [{"Document": "Doc 1", "Word": "Word 1", "Count": 3},
{"Document": "Doc 2", "Word": "Word 2", "Count": 1},
{"Document": "Doc 3", "Word": "Word 3", "Count": 2},
{"Document": "Doc 3", "Word": "Word 1", "Count": 6},
{"Document": "Doc 1", "Word": "Word 2", "Count": 1},
{"Document": "Doc 2", "Word": "Word 3", "Count": 7}]
df = pd.DataFrame(x)
df = df.groupby(["Word", "Document"]).sum().reset_index()
df.pivot(index="Word", columns="Document")
这具有将您的Document和Word值用作快速访问索引的附加好处。