我目前正试图解决这个家庭作业问题。
我的任务是实现一个函数,该函数返回给定文本中的单词计数向量。我需要将文本拆分为单词,然后使用NLTK's
tokeniser来标记每个句子。
这是我到目前为止的代码:
import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
def word_counts(text, words):
"""Return a vector that represents the counts of specific words in the text
>>> word_counts("Here is sentence one. Here is sentence two.", ['Here', 'two', 'three'])
[2, 1, 0]
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> word_counts(emma, ['the', 'a'])
[4842, 3001]
"""
from nltk.tokenize import TweetTokenizer
text = nltk.sent_tokenize(text)
words = nltk.sent_tokenize(words)
wordList = []
for sen in text, words:
for word in nltk.word_tokenize(sen):
wordList.append(text, words).split(word)
counter = TweetTokenizer(wordList)
return counter
有两个doctests应该给出结果: [2,1,0]和[4842,3001]
我花了一整天时间试图解决这个问题,我觉得我已经接近但我不知道自己做错了什么,剧本每次都给我一个错误时间。
任何帮助将非常感谢。 谢谢。
答案 0 :(得分:2)
import nltk
import collections
from nltk.tokenize import TweetTokenizer
# nltk.download('punkt')
# nltk.download('gutenberg')
# nltk.download('brown')
def word_counts(text, words):
"""Return a vector that represents the counts of specific words in the text
word_counts("Here is sentence one. Here is sentence two.", ['Here', 'two', 'three'])
[2, 1, 0]
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
word_counts(emma, ['the', 'a'])
[4842, 3001]
"""
textTok = nltk.word_tokenize(text)
counts = nltk.FreqDist(textTok) # this counts all word occurences
return [counts[x] or 0 for x in words] # this returns what was counted for *words
r1 = word_counts("Here is sentence one. Here is sentence two.", ['Here', 'two', 'three'])
print(r1) # [2, 1, 0]
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
r2 = word_counts(emma, ['the', 'a'])
print(r2) # [4842, 3001]