从file_test.txt我需要计算使用nltk.FreqDist()函数在文件中出现的每个单词的次数。当我计算字频时,我需要查看该字是否在pos_dict.txt中,如果是,则将字频数乘以pos_dict.txt中相同字的数字。
file_test.txt
看起来像这样:
abandon, abandon, calm, clear
对于这些话, pos_dict.txt
看起来像这样:
"abandon":2,"calm":2,"clear":1,...
我的代码是:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import nltk
f_input_pos=open('file_test.txt','r').read()
def features_pos(dat):
tokens = nltk.word_tokenize(dat)
fdist=nltk.FreqDist(tokens)
f_pos_dict=open('pos_dict.txt','r').read()
f=f_pos_dict.split(',')
for part in f:
b=part.split(':')
c=b[-1] #to catch the number
T2 = eval(str(c).replace("'","")) # convert number from string to int
for word in fdist:
if word in f_pos_dict:
d=fdist[word]
print(word,'->',d*T2)
features_pos(f_input_pos)
所以我的输出必须是这样的:
abandon->4
calm->2
clear->1
但我的输出是重复所有输出而且显然是错误的乘法。我有点陷入困境,我不知道错误在哪里,可能我错误地使用了for循环。如果有人可以提供帮助,我将不胜感激:)
答案 0 :(得分:1)
首先,通过将其作为字典的字符串表示形式读取,您可以快速阅读pos_dict.txt
文件:
alvas@ubi:~$ echo '"abandon":2,"calm":2,"clear":1' > pos_dict.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> with io.open('pos_dict.txt', 'r') as fin:
... pos_dict = eval("{" + fin.read() + "}")
...
>>>
>>> pos_dict['abandon']
2
>>> pos_dict['clear']
1
接下来,要阅读您的file_test.txt
,我们必须阅读文件,删除标题和尾随空格,然后将单词分开,
(逗号后跟空格)。
然后使用collections.Counter
对象,我们可以轻松获取令牌数量(另请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist):
alvas@ubi:~$ echo 'abandon, abandon, calm, clear' > file_test.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
... tokens = fin.read().strip().split(', ')
...
>>> Counter(tokens)
Counter({u'abandon': 2, u'clear': 1, u'calm': 1})
要从file_test.txt
访问令牌计数并将它们与pos_dict.txt
的值相乘,我们使用.items()
函数迭代Counter对象(就像我们如何访问字典一样)键值对):
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
... tokens = fin.read().strip().split(', ')
...
>>>
>>> word_counts = Counter(tokens)
>>> with io.open('pos_dict.txt', 'r') as fin:
... pos_dict = eval("{" + fin.read() + "}")
...
>>>
>>> token_times_posdict = {word:freq*pos_dict[word] for word, freq in Counter(tokens).items()}
>>> token_times_posdict
{u'abandon': 4, u'clear': 1, u'calm': 2}
然后打印出来:
>>> for word, value in token_times_posdict.items():
... print "{} -> {}".format(word, value)
...
abandon -> 4
clear -> 1
calm -> 2