Question

我需要列出出现特定单词的文件数量

示例：

data = ["This is my pen","That is his pen","This is not my pen"]

所需的输出：

{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}

for sent in documents:
    for word in sent.split():

    if word in sent:

        windoc=dict(Counter(sent.split()))
        print(windoc)

Answer 1

考虑到每个文档中一个单词的计数不得超过一次：

import collections

data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq =  collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]

您需要首先对单词进行重复数据删除（请参见上面的deduped）。为了避免使用中间列表集，我制作了重复数据消除器，但是无论如何，这将为每个文档生成中间单词集。

或者，您可以实现自己的计数器。通常，实现自己的计数器并不是一个好主意，但是如果内存消耗至关重要，并且您希望避免在通过deduped生成器进行迭代时创建中间集，则可能需要实现自己的计数器。

无论哪种方式，时间和内存复杂度都是线性的。

输出：

[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

Answer 2

您可以根据所有可用句子来构建字典来容纳words frequency。然后构造所需的输出。这是一个工作示例：

提供输入文档：

In [1]: documents 
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']

构建单词频率词典：

In [2]: d = {}
    ...: for sent in documents:
    ...:     for word in set(sent.split()):    
    ...:         d[word] = d.get(word, 0) + 1
    ...:

然后构造所需的输出：

In [3]: result = []
    ...: for sent in documents:
    ...:     result.append({word: d[word] for word in sent.split()})
    ...:     

In [4]: result 
Out[4]: 
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

因此，总体而言，代码如下：

documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
    for word in set(sent.split()):    
        d[word] = d.get(word, 0) + 1

# format the output in the desired format
result = []
for sent in documents:
    result.append({word: d[word] for word in sent.split()})

Answer 3

from collections import Counter

data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]

d = []
for s in data:
    for word in set(s.split()):
        d.append(word)

wordCount = Counter(d)
for item in data:
    result = {}
    for word in item.split():
        result[word] = wordCount[word]
    print (result)

输出：

{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}

查找有关文档中单词出现的单词频率

3 个答案: