计算字频率并在输出文件中写入

时间:2016-05-23 09:48:15

标签: python-3.x nltk

从file_test.txt我需要计算使用nltk.FreqDist()函数在文件中出现的每个单词的次数。当我计算字频时,我需要查看该字是否在pos_dict.txt中,如果是,则将字频数乘以pos_dict.txt中相同字的数字。

file_test.txt看起来像这样:

  abandon, abandon, calm, clear
对于这些话,

pos_dict.txt看起来像这样:

"abandon":2,"calm":2,"clear":1,...

我的代码是:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import nltk

f_input_pos=open('file_test.txt','r').read()

def features_pos(dat):
    tokens = nltk.word_tokenize(dat)
    fdist=nltk.FreqDist(tokens)

    f_pos_dict=open('pos_dict.txt','r').read()
    f=f_pos_dict.split(',') 

    for part in f:
        b=part.split(':')
        c=b[-1]   #to catch the number
        T2 = eval(str(c).replace("'","")) # convert number from string to int

        for word in fdist:
            if word in f_pos_dict:
               d=fdist[word]
               print(word,'->',d*T2)


features_pos(f_input_pos)

所以我的输出必须是这样的:

abandon->4
calm->2
clear->1

但我的输出是重复所有输出而且显然是错误的乘法。我有点陷入困境,我不知道错误在哪里,可能我错误地使用了for循环。如果有人可以提供帮助,我将不胜感激:)

1 个答案:

答案 0 :(得分:1)

首先,通过将其作为字典的字符串表示形式读取,您可以快速阅读pos_dict.txt文件:

alvas@ubi:~$ echo '"abandon":2,"calm":2,"clear":1' > pos_dict.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> pos_dict['abandon']
2
>>> pos_dict['clear']
1

接下来,要阅读您的file_test.txt,我们必须阅读文件,删除标题和尾随空格,然后将单词分开,(逗号后跟空格)。

然后使用collections.Counter对象,我们可以轻松获取令牌数量(另请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist):

alvas@ubi:~$ echo 'abandon, abandon, calm, clear' > file_test.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> Counter(tokens)
Counter({u'abandon': 2, u'clear': 1, u'calm': 1})

要从file_test.txt访问令牌计数并将它们与pos_dict.txt的值相乘,我们使用.items()函数迭代Counter对象(就像我们如何访问字典一样)键值对):

>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> 
>>> word_counts = Counter(tokens)
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> token_times_posdict = {word:freq*pos_dict[word] for word, freq in Counter(tokens).items()}
>>> token_times_posdict
{u'abandon': 4, u'clear': 1, u'calm': 2}

然后打印出来:

>>> for word, value in token_times_posdict.items():
...     print "{} -> {}".format(word, value)
... 
abandon -> 4
clear -> 1
calm -> 2