Question

假设我有以下字符串：

"hello&^uevfehello!`.<hellohow*howdhAreyou"

我如何计算作为其子串的英语单词的频率？在这种情况下，我想要一个结果，如：

{'hello': 3, 'how': 2, 'are': 1, 'you': 1}

我搜索了与此类似的上一个问题，但我找不到任何可行的问题。一个接近的解决方案似乎是使用正则表达式，但它也没有用。这可能是因为我实施它错了，因为我不熟悉它实际上是如何工作的。

How to find the count of a word in a string? 这是最后的答案

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

我还尝试创建一个非常糟糕的函数，遍历字符串中连续字母的每个可能排列（最多8个字母左右）。这样做的问题是

1）它比它应该更长的时间和

2）它增加了额外的单词。例如：如果“hello”在字符串中，也会找到“hell”。

我对正则表达式不是很熟悉，这可能是正确的方法。

Answer 1

d, w = "hello&^uevfehello!`.<hellohow*howdhAreyou", ["hello","how","are","you"]
import re, collections
pattern = re.compile("|".join(w), flags = re.IGNORECASE)
print collections.Counter(pattern.findall(d))

<强>输出

Counter({'hello': 3, 'how': 2, 'you': 1, 'Are': 1})

Answer 2

from collections import defaultdict

s = 'hello&^uevfehello!`.<hellohow*howdhAreyou'
word_counts = defaultdict(lambda: 0)

i = 0
while i < len(s):
    j = len(s)
    while j > i:
        if is_english_word(s[i:j]):
            word_counts[s[i:j]] += 1
            break
        j -= 1

    if j == i:
        i += 1
    else:
        i = j

print word_counts

Answer 3

您需要从字符串中提取所有单词，然后为每个单词找到子字符串，然后检查是否有任何子字符串是英语单词。我在How to check if a word is an English word with Python?

中使用了英语词典

结果中存在一些误报，因此您可能希望使用更好的字典或使用自定义方法检查所需的单词。

import re
import enchant
from collections import defaultdict

# Get all substrings in given string.
def get_substrings(string):
    for i in range(0, len(string)):
        for j in range(i, len(string)):
            yield s[i:j+1]

text = "hello&^uevfehello!`.<hellohow*howdhAreyou"

strings = re.split(r"[^\w']+", text.lower())

# Use english dictionary to check if a word exists.
dictionary = enchant.Dict("en_US")
counts = defaultdict(int)
for s in strings:
  for word in get_substrings(s):
      if (len(word) > 1 and dictionary.check(word)):
          counts[word] += 1

print counts

输出：

defaultdict（，{'是'：1，'oho'：1，'eh'：1，'ell'：3， '哦'：1，'lo'：3，'ll'：3，'哟'：1，'怎么'：2，'野兔'：1，'ho'：2， 'ow'：2，'地狱'：3，'你'：1，'哈'：1，'你好'：3，'重'：1，'他'：3}）

字符串中的字频率没有空格和特殊字符？

3 个答案: