计算文本文件中的单词

时间:2013-05-28 08:59:00

标签: python nltk

我有.txt文件(示例):

  

专业人士是从事某项活动的人,或   占领,获得或补偿作为谋生手段;比如一个   永久的职业生涯,而不是业余或消遣。由于个人原因   和许多专业服务的保密性,因此   大多数专业人士都必须对他们充满信任   遵守严格的道德规范和严格的行为准则   道德义务。

如何统计有多少次"专业"? (使用NLTK - 是最好的选择吗?)

text_file = open("text.txt", "r+b")

5 个答案:

答案 0 :(得分:4)

我已经改变了我的回答,以更好地反映您的意愿:

from nltk import word_tokenize

with open('file_path') as f:
    content = f.read()
# we will use your text example instead:
content = "A professional is a person who is engaged in a certain activity, or occupation, for gain or compensation as means of livelihood; such as a permanent career, not as an amateur or pastime. Due to the personal and confidential nature of many professional services, and thus the necessity to place a great deal of trust in them, most professionals are subject to strict codes of conduct enshrining rigorous ethical and moral obligations."

def Count_Word(word, data):
    c = 0
    tokens = word_tokenize(data)
    for token in tokens:
        token = token.lower()
        # this plural check is dangerous, if trying to find a word that ends with an 's'
        token = token[:-1] if token[-1] == 's' else token
        if token == word:
            c += 1
    return c

print Count_Word('professional', content)
>>>
3

以下是该方法的修改版本:

def Count_Word(word, data, leading=[], trailing=["'s", "s"]):
    c = 0
    tokens = word_tokenize(data)
    for token in tokens:
        token = token.lower()
        for lead in leading:
            if token.startswith(lead):
                token = token.partition(lead)[2]
        for trail in trailing:
            if token.endswith(trail):
                token = token.rpartition(trail)[0]
        if token == word:
            c += 1
    return c

我已添加到可选参数中,这些参数是要修剪的单词的前导或尾随部分的列表,以便找到它...目前我只放置默认's或{{1 }。但如果你发现其他人都适合你,你可以随时添加它们。如果列表开始变长,你就可以使它们成为常量。

答案 1 :(得分:4)

可以在一行中解决(加上导入):

>>> from collections import Counter
>>> Counter(w.lower() for w in open("text.txt").read().split())['professional']
2

答案 2 :(得分:3)

您可以简单地对字符串进行标记,然后搜索所有标记......但这只是一种方式。还有很多其他的......

s = text_file.read()
tokens = nltk.word_tokenize(s)
counter = 0
for token in tokens:
  toke = token
  if token[-1] == "s":
    toke = token[0:-1]
  if toke.lower() == "professional":
    counter += 1

print counter

答案 3 :(得分:1)

from collections import Counter

def stem(word):
    if word[-1] == 's':
        word = word[:-1]
    return word.lower()

print Counter(map(stem, open(filename).read().split()))

答案 4 :(得分:1)

您问题的答案取决于您想要计算的具体内容以及您希望将哪些工作量用于规范化。我看到至少有三种方法,具体取决于你的目标。

在下面的代码中,我定义了三个函数,它们返回输入文本中出现的所有单词的计数字典。

import nltk
from collections import defaultdict

text = "This is my sample text."

lower = text.lower()

tokenized = nltk.word_tokenize(lower)

ps = nltk.stem.PorterStemmer()
wnlem = nltk.stem.WordNetLemmatizer()

# The Porter stemming algorithm tries to remove all suffixes from a word.
# There are better stemming algorithms out there, some of which may be in NLTK.
def StemCount(token_list):
    countdict = defaultdict(int)
    for token in token_list:
        stem = ps.stem(token)
        countdict[stem] += 1
    return countdict

# Lemmatizing is a little less brutal than stemming--it doesn't try to relate
#   words across parts of speech so much. You do, however, need to part of speech tag
#   the text before you can use this approach.
def LemmaCount(token_list):
    # Where mytagger is a part of speech tagger 
    #   you've trained (perhaps per http://nltk.sourceforge.net/doc/en/ch03.html)
    #   using a simple tagset compatible with WordNet (i.e. all nouns become 'n', etc)
    token_pos_tuples = mytagger.tag(token_list)
    countdict = defaultdict(int)
    for token_pos in token_pos_tuples:
        lemma = wnlem.lemmatize(token_pos[0],token_pos[1])
        countdict[lemma] += 1

# Doesn't do anything fancy. Just counts the number of occurrences for each unique
#   string in the input.
def SimpleCount(token_list):
    countdict = defaultdict(int)
    for token in token_list:
        countdict[token] += 1
    return countdict

要举例说明PorterStemmer和WordNetLemmatizer之间的差异,请考虑以下事项:

>>> wnlem.lemmatize('professionals','n')
'professional'
>>> ps.stem('professionals')
'profession'

使用上面代码片段中定义的wnlem和ps。

根据您的应用程序,像SimpleCount(token_list)这样的东西可能会正常工作。