获取句子列表的单词频率,但不能合并。 (蟒)

时间:2017-09-27 05:56:20

标签: python

def getWordFreq(corpus):

    wordFreq = []
    for sent in corpus:
        for word in sent:
            wordFreq.append((word, sent.count(word)))
    return wordFreq

写这个函数来获得语料库中每个单词的频率。

为了测试它,我写了

cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)

但是这返回了

[('hi', 1), ('ho', 1), ('hee', 1), ('ho', 1)]

而不是('ho',2)。

我错过了什么?

4 个答案:

答案 0 :(得分:1)

您可以尝试此解决方案:

from collections import Counter
def getWordFreq(corpus):
    wordFreq = [j for i in corpus for j in i]
    return list(Counter(wordFreq).items())

答案 1 :(得分:1)

希望这个最简单的一个会有所帮助。我们在这里使用for

Try this code snippet here

def getWordFreq(corpus):
    result = {}
    for data in corpus:
        for word in data:
            if word in result:
                result[word] += 1 #adding result in the dictionary
            else:
                result[word] = 1

    return result.items() #returning items

cc = [['hi', 'ho'], ['hee', 'ho']]
print(getWordFreq(cc))

输出: [('hee', 1), ('hi', 1), ('ho', 2)]

答案 2 :(得分:0)

你最好使用字典来完成这项任务:

def getWordFrequency(corpus):
    frequencies = {}
    for sentence in corpus:
        for word in sentence:
            if word in frequencies:
                frequencies[word] += 1
            else:
                frequencies[word] = 1
    return frequencies

字典保留从键(字)到值(对应频率)的映射。这样,跟踪频率就更容易,更快,因为您不必自己处理合并单词。

您的实现只会在句子中添加一个由单词及其频率组成的元组。这不会将单词组合在一起并为您跟踪频率。

Python的collections模块还为此类事件提供了Counter

from collections import Counter
def getWordFrequency(corpus):
    freq = Counter()
    for sentence in corpus:
        for word in sentence:
            freq[word] += 1
    return freq

请注意,我们不必检查该单词是否已存在于计数器中,因为Counter会为我们处理。

答案 3 :(得分:0)

正如我在评论中提到的那样,您在计算sent中的每个单词而不是整个corpus这里是您需要做的事情

 def getWordFreq(corpus):
        wordFreq = []
        for sent in corpus:
            for word in sent:
                wordFreq.append((word, sum(map(lambda x: x.count(word), corpus))))
        return wordFreq


cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)

给出

[('hi', 1), ('ho', 2), ('hee', 1), ('ho', 2)]

如果您只打印一个单词,请将wordFreq更改为set并使用add代替append

def getWordFreq(corpus):

    wordFreq = set()
    for sent in corpus:
        for word in sent:
            wordFreq.add((word, sum(map(lambda x: x.count(word), corpus))))
    return wordFreq

cc = [ ['hi','ho'], ['hee','ho']]
getWordFreq(cc)

给出

{('hee', 1), ('hi', 1), ('ho', 2)}