我需要列出出现特定单词的文件数量
示例:
data = ["This is my pen","That is his pen","This is not my pen"]
所需的输出:
{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}
for sent in documents:
for word in sent.split():
if word in sent:
windoc=dict(Counter(sent.split()))
print(windoc)
答案 0 :(得分:2)
考虑到每个文档中一个单词的计数不得超过一次:
import collections
data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq = collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]
您需要首先对单词进行重复数据删除(请参见上面的deduped
)。为了避免使用中间列表集,我制作了重复数据消除器,但是无论如何,这将为每个文档生成中间单词集。
或者,您可以实现自己的计数器。通常,实现自己的计数器并不是一个好主意,但是如果内存消耗至关重要,并且您希望避免在通过deduped
生成器进行迭代时创建中间集,则可能需要实现自己的计数器。
无论哪种方式,时间和内存复杂度都是线性的。
输出:
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
{'That': 1, 'his': 1, 'is': 3, 'pen': 3},
{'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]
答案 1 :(得分:1)
您可以根据所有可用句子来构建字典来容纳words frequency
。然后构造所需的输出。这是一个工作示例:
提供输入文档:
In [1]: documents
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']
构建单词频率词典:
In [2]: d = {}
...: for sent in documents:
...: for word in set(sent.split()):
...: d[word] = d.get(word, 0) + 1
...:
然后构造所需的输出:
In [3]: result = []
...: for sent in documents:
...: result.append({word: d[word] for word in sent.split()})
...:
In [4]: result
Out[4]:
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
{'That': 1, 'his': 1, 'is': 3, 'pen': 3},
{'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]
因此,总体而言,代码如下:
documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
for word in set(sent.split()):
d[word] = d.get(word, 0) + 1
# format the output in the desired format
result = []
for sent in documents:
result.append({word: d[word] for word in sent.split()})
答案 2 :(得分:1)
from collections import Counter
data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]
d = []
for s in data:
for word in set(s.split()):
d.append(word)
wordCount = Counter(d)
for item in data:
result = {}
for word in item.split():
result[word] = wordCount[word]
print (result)
输出:
{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}