您也可以直接访问其行，而不是写出矩阵。

Question

要创建文档术语矩阵，我将文本文件result.txt作为输入。我试图用这种方式计算出现的词：

Counter({'STTP': 6, 'AVENUES': 4, 'ENGINEERING': 4, 'MINING': 4, 'THE': 4, 'SCOE': 4, 'HERE': 4, 'DATA': 4, 'TOOLS': 4, 'PROGRAMMING': 3, 'TEMPERATURE': 3})

但是以这种方式得到了结果：

"degree,the,mituski,programming,national,it,high,sakal,engineering,paper,college,signed
1,4,2,3,1,2,1,1,4,1,1,1"

以下是我使用的代码：

tdm = textmining.TermDocumentMatrix()

files = glob.glob("result.txt")

for f in files:

    content = open(f).read()

    content = content.replace('\n', ' \n')

    tdm.add_doc(content)

    tdm.write_csv('matrix1.csv', cutoff=1)

Answer 1

结果是格式正确的csv文件。第一行是标题（单词），第二行是单词的计数。

您展示的内容看起来像传递给dict构造函数的class。

来自Python Textmining Package：

您也可以直接访问其行，而不是写出矩阵。
# Let's print them to the screen.
for row in tdm.rows(cutoff=1):
    print row

因此，为了获得问题中的dict，您可以访问：

result_rows = list(tdm.rows(cutoff=1))
result_dict = {}

for i in range(len(result_rows[0])):
    result_dict[result_rows[0][i]] = result_rows[1][i]

如何在python中创建文档术语频率矩阵

1 个答案:

您也可以直接访问其行，而不是写出矩阵。