Question

我想使用Python为文本文件创建一个unigram和bigram计数矩阵以及一个类变量到csv 文本文件包含两列，如下所示

Text                                                  Class
I love the movie                                      Pos
I hate the movie                                      Neg

我希望unigram和bigram计为text列，输出应该写入csv文件

I     hate      love        movie   the        class
1     0         1           1       1          Pos
1     1         0           1       1          Neg

两字组

I love     love the     the movie     I hate    hate the         class
1            1              1         0          0               Pos
0            0              1         1          1               Neg

有人可以帮我改进下面的代码到上面提到的输出格式吗？

>>> import nltk
>>> from collections import Counter
>>> fo = open("text.txt")
>>> fo1 = fo.readlines()
>>> for line in fo1:
       bigm = list(nltk.bigrams(line.split()))
       bigmC = Counter(bigm)
       for key, value in bigmC.items():
           print(key, value)

('love', 'the') 1
('the', 'movie') 1
('I', 'love') 1
('I', 'hate') 1
('hate', 'the') 1
('the', 'movie') 1

Answer 1

我已经使您的输入文件更加详细，因此您可以相信解决方案有效：

I love the movie movie
I hate the movie
The movie was rubbish
The movie was fantastic

第一行包含一个单词两次，否则你无法告诉计数器实际上正在计数。

解决方案：

import csv
import nltk
from collections import Counter
fo = open("text.txt")
fo1 = fo.readlines()
counter_sum = Counter()
for line in fo1:
       tokens = nltk.word_tokenize(line)
       bigrams = list(nltk.bigrams(line.split()))
       bigramsC = Counter(bigrams)
       tokensC = Counter(tokens)
       both_counters = bigramsC + tokensC
       counter_sum += both_counters
       # This basically collects the whole 'population' of words and bigrams in your document

# now that we have the population can write a csv

with open('unigrams_and_bigrams.csv', 'w', newline='') as csvfile:
    header = sorted(counter_sum, key=lambda x: str(type(x)))
    writer = csv.DictWriter(csvfile, fieldnames=header)
    writer.writeheader()
    for line in fo1:
          tokens = nltk.word_tokenize(line)
          bigrams = list(nltk.bigrams(line.split()))
          bigramsC = Counter(bigrams)
          tokensC = Counter(tokens)
          both_counters = bigramsC + tokensC
          cs = dict(counter_sum)
          bc = dict(both_counters)
          row = {}
          for element in list(cs):
                if element in list(bc):
                  row[element] = bc[element]
                else:
                  row[element] = 0
          writer.writerow(row)

所以，我使用并建立在你最初的方法上。你没有说你是否想要单独的csv中的双胞胎和unigrams，所以假设你想要它们在一起。对你来说，重新编程不会太难。使用已经内置到NLP库中的工具可能更好地以这种方式累积人口，但有趣的是看到它可以做得更低级别。我顺便使用Python 3，如果需要在Python 2中使用它，可能需要更改一些内容，例如使用list。

使用的一些有趣的参考文献this one on summing counters对我来说是新的。另外，我必须ask a question让您的双字母组合和非语言分组在CSV的不同端。

我知道代码看起来很重复，但是在开始编写代码之前，需要首先遍历所有行以获取csv的标头。

这是libreoffice中的输出

你的csv将会收集所有的unigrams和bigrams。如果你真的想要在标题中没有括号和逗号的双字母组合，那么你可以创建一些能够做到这一点的函数。将它们保留为元组可能更好，但是如果你需要在某些时候再次将它们解析为Python，它就像可读一样......

您没有包含生成类列的代码，假设您拥有它，您可以附加字符串＆＃39; Class＆＃39;在标题写入csv以创建该列并填充它之前，在标题上，

row['Class'] = sentiment

在行写入之前的第二行。

如何使用Python为文本文件创建一个unigram和bigram计数矩阵以及一个类变量到csv？

1 个答案: