Question

我需要计算句子中的单词数。我用

word_matrix[i][j] = sentences[i].count([*words_dict][j])

但是当一个单词包含在另一个单词中时（例如，“交互”中包含“ in”），它也会计算在内。如何避免呢？

Answer 1

您可以为此使用collections.Counter：

from collections import Counter
s = 'This is a sentence'

Counter(s.lower().split())

# Counter({'this': 1, 'is': 1, 'a': 1, 'sentence': 1})

Answer 2

您可以这样做：

sentence = 'this is a test sentence'
word_count = len(sentence.split(' '))

在这种情况下，word_count为5。

Answer 3

根据情况，最有效的解决方案是使用collection.Counter，但您会错过所有带有符号的单词：
即in与interactive（根据需要）不同，但也与in:不同。
考虑此问题的替代解决方案可能是计算RegEx的匹配模式：

import re

my_count = re.findall(r"(?:\s|^)({0})(?:[\s$\.,;:])".format([*words_dict][j]), sentences[i])
print(len(my_count))

RegEx在做什么？
对于给定的单词，您匹配：
相同的单词，其前面带有空格或行(\s|^)
然后在方括号（[\s$\.,;:]中加上空格，行尾，点，逗号和任何符号

Answer 4

使用split标记语句中的单词，然后使用逻辑（如果dict中存在单词），然后将该值加1，否则将count设为1的单词添加

paragraph='Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been' 
words=paragraph.split()
word_count={}
counter=0
for i in words:
    if i in word_count:
        word_count[i]+=1
    else:
        word_count[i]=1

print(word_count)

计算单词和字符串的频率

4 个答案: