def getWordFreq(corpus):
wordFreq = []
for sent in corpus:
for word in sent:
wordFreq.append((word, sent.count(word)))
return wordFreq
写这个函数来获得语料库中每个单词的频率。
为了测试它,我写了
cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)
但是这返回了
[('hi', 1), ('ho', 1), ('hee', 1), ('ho', 1)]
而不是('ho',2)。
我错过了什么?
答案 0 :(得分:1)
您可以尝试此解决方案:
from collections import Counter
def getWordFreq(corpus):
wordFreq = [j for i in corpus for j in i]
return list(Counter(wordFreq).items())
答案 1 :(得分:1)
希望这个最简单的一个会有所帮助。我们在这里使用for
。
def getWordFreq(corpus):
result = {}
for data in corpus:
for word in data:
if word in result:
result[word] += 1 #adding result in the dictionary
else:
result[word] = 1
return result.items() #returning items
cc = [['hi', 'ho'], ['hee', 'ho']]
print(getWordFreq(cc))
输出: [('hee', 1), ('hi', 1), ('ho', 2)]
答案 2 :(得分:0)
你最好使用字典来完成这项任务:
def getWordFrequency(corpus):
frequencies = {}
for sentence in corpus:
for word in sentence:
if word in frequencies:
frequencies[word] += 1
else:
frequencies[word] = 1
return frequencies
字典保留从键(字)到值(对应频率)的映射。这样,跟踪频率就更容易,更快,因为您不必自己处理合并单词。
您的实现只会在句子中添加一个由单词及其频率组成的元组。这不会将单词组合在一起并为您跟踪频率。
Python的collections
模块还为此类事件提供了Counter
。
from collections import Counter
def getWordFrequency(corpus):
freq = Counter()
for sentence in corpus:
for word in sentence:
freq[word] += 1
return freq
请注意,我们不必检查该单词是否已存在于计数器中,因为Counter
会为我们处理。
答案 3 :(得分:0)
正如我在评论中提到的那样,您在计算sent
中的每个单词而不是整个corpus
这里是您需要做的事情
def getWordFreq(corpus):
wordFreq = []
for sent in corpus:
for word in sent:
wordFreq.append((word, sum(map(lambda x: x.count(word), corpus))))
return wordFreq
cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)
给出
[('hi', 1), ('ho', 2), ('hee', 1), ('ho', 2)]
如果您只打印一个单词,请将wordFreq
更改为set
并使用add
代替append
def getWordFreq(corpus):
wordFreq = set()
for sent in corpus:
for word in sent:
wordFreq.add((word, sum(map(lambda x: x.count(word), corpus))))
return wordFreq
cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)
给出
{('hee', 1), ('hi', 1), ('ho', 2)}