Question

从file_test.txt我需要计算使用nltk.FreqDist（）函数在文件中出现的每个单词的次数。当我计算字频时，我需要查看该字是否在pos_dict.txt中，如果是，则将字频数乘以pos_dict.txt中相同字的数字。

file_test.txt看起来像这样：

  abandon, abandon, calm, clear

对于这些话，

pos_dict.txt看起来像这样：

"abandon":2,"calm":2,"clear":1,...

我的代码是：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import nltk

f_input_pos=open('file_test.txt','r').read()

def features_pos(dat):
    tokens = nltk.word_tokenize(dat)
    fdist=nltk.FreqDist(tokens)

    f_pos_dict=open('pos_dict.txt','r').read()
    f=f_pos_dict.split(',') 

    for part in f:
        b=part.split(':')
        c=b[-1]   #to catch the number
        T2 = eval(str(c).replace("'","")) # convert number from string to int

        for word in fdist:
            if word in f_pos_dict:
               d=fdist[word]
               print(word,'->',d*T2)


features_pos(f_input_pos)

所以我的输出必须是这样的：

abandon->4
calm->2
clear->1

但我的输出是重复所有输出而且显然是错误的乘法。我有点陷入困境，我不知道错误在哪里，可能我错误地使用了for循环。如果有人可以提供帮助，我将不胜感激：）

Answer 1

首先，通过将其作为字典的字符串表示形式读取，您可以快速阅读pos_dict.txt文件：

alvas@ubi:~$ echo '"abandon":2,"calm":2,"clear":1' > pos_dict.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> pos_dict['abandon']
2
>>> pos_dict['clear']
1

接下来，要阅读您的file_test.txt，我们必须阅读文件，删除标题和尾随空格，然后将单词分开,（逗号后跟空格）。

然后使用collections.Counter对象，我们可以轻松获取令牌数量（另请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist）：

alvas@ubi:~$ echo 'abandon, abandon, calm, clear' > file_test.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> Counter(tokens)
Counter({u'abandon': 2, u'clear': 1, u'calm': 1})

要从file_test.txt访问令牌计数并将它们与pos_dict.txt的值相乘，我们使用.items()函数迭代Counter对象（就像我们如何访问字典一样）键值对）：

>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> 
>>> word_counts = Counter(tokens)
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> token_times_posdict = {word:freq*pos_dict[word] for word, freq in Counter(tokens).items()}
>>> token_times_posdict
{u'abandon': 4, u'clear': 1, u'calm': 2}

然后打印出来：

>>> for word, value in token_times_posdict.items():
...     print "{} -> {}".format(word, value)
... 
abandon -> 4
clear -> 1
calm -> 2

计算字频率并在输出文件中写入

1 个答案: