查找有关文档中单词出现的单词频率

时间:2019-07-17 11:41:49

标签: python python-3.x

我需要列出出现特定单词的文件数量

示例:

data = ["This is my pen","That is his pen","This is not my pen"]

所需的输出:

{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}

for sent in documents:
    for word in sent.split():

    if word in sent:

        windoc=dict(Counter(sent.split()))
        print(windoc)

3 个答案:

答案 0 :(得分:2)

考虑到每个文档中一个单词的计数不得超过一次:

import collections

data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq =  collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]

您需要首先对单词进行重复数据删除(请参见上面的deduped)。为了避免使用中间列表集,我制作了重复数据消除器,但是无论如何,这将为每个文档生成中间单词集。

或者,您可以实现自己的计数器。通常,实现自己的计数器并不是一个好主意,但是如果内存消耗至关重要,并且您希望避免在通过deduped生成器进行迭代时创建中间集,则可能需要实现自己的计数器。

无论哪种方式,时间和内存复杂度都是线性的。

输出:

[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

答案 1 :(得分:1)

您可以根据所有可用句子来构建字典来容纳words frequency。然后构造所需的输出。这是一个工作示例:

提供输入文档:

In [1]: documents 
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']

构建单词频率词典:

In [2]: d = {}
    ...: for sent in documents:
    ...:     for word in set(sent.split()):    
    ...:         d[word] = d.get(word, 0) + 1
    ...: 

然后构造所需的输出:

In [3]: result = []
    ...: for sent in documents:
    ...:     result.append({word: d[word] for word in sent.split()})
    ...:     

In [4]: result 
Out[4]: 
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

因此,总体而言,代码如下:

documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
    for word in set(sent.split()):    
        d[word] = d.get(word, 0) + 1

# format the output in the desired format
result = []
for sent in documents:
    result.append({word: d[word] for word in sent.split()})

答案 2 :(得分:1)

from collections import Counter

data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]

d = []
for s in data:
    for word in set(s.split()):
        d.append(word)

wordCount = Counter(d)
for item in data:
    result = {}
    for word in item.split():
        result[word] = wordCount[word]
    print (result)

输出:

{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}