Question

我有一个这样的句子：

s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye foo"

我想找到单词序列的所有连续重复以及每个序列重复的次数。对于上面的示例：

[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]

我有一个解决方案，几乎可以使用基于正则表达式的仅一个字符的单词，但是我不能将其扩展到真实单词的情况：

def count_repetitions(sentence):
    return [(list(t[0]),''.join(t).count(t[0])) for t in re.findall(r'(\w+)(\1+)', ''.join(sentence))]

 l=['x', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'g', 'h', 'i', 'i', 'i', 'i', 'a', 'b', 'c', 'd']
 count_repetitions(sentence)
 >>> [(['a', 'b', 'c'], 3), (['g', 'h'], 2), (['i', 'i'], 2)]

请注意，我想在最后一个元素上输入(['i'], 4)。

每个单词都用空格分隔。

Answer 1

这可以通过正则表达式在捕获组的帮助下完成。

通常，您可以使用如下所示的正则表达式来捕获重复的模式：(pattern)\1+。这样做是递归地尝试匹配一个pattern，其后至少要匹配一次。

要使其适应您的问题，我们只需要考虑到您希望单词由空格分隔。这是我们的新正则表达式：\b((.+?)(?:\s\2)+)。

(        # open a group to capture the whole expression, GROUP 1
  (      # open a group to capture the repeated token, GROUP 2
    \b   # boundary metacharacters ensure the token is a whole word
    .+?  # matches anything non-greedily
    \b
  )
  (?:    # open a non-capturing group for the repeated terms
    \s   # match a space
    \2   # match anything matched by GROUP 2
  )+     # match one time or more
 )

然后使用re.findall我们可以找到所有这些模式并评估其重复次数。

代码

import re

def find_repeated_sequences(s):
    match = re.findall(r'((\b.+?\b)(?:\s\2)+)', s)
    return [(m[1], int((len(m[0]) + 1) / (len(m[1]) + 1))) for m in match]

注意：公式(len(m[0]) + 1) / (len(m[1]) + 1)假定文本仅是单行距，并且来自求解方程式：

length _total =计数x（length _el + 1）-1

示例

s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye"
print(find_repeated_sequences(s))

输出

[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]

Answer 2

假设字符串中的每个单词都被空格分隔

stringList = s.split(" ")
stringOccurrence = {}
for index in range(0, len(stringList)):


    if stringList[index] not in stringOccurrence.keys():
        stringOccurrence[stringList[index]] = [index]

    else:
        val =  stringOccurrence[stringList[index]]
        val.append(index)
print(stringOccurrence)

将给出：

{'I': [4],
 'a': [6, 9],
 'am': [5],
 'big': [7, 10],
 'bye': [14, 15, 16, 17],
 'foo': [0, 18],
 'hello': [1, 2, 3, 12, 13],
 'mushroom': [8, 11]}

现在，您遍历键，值对列表并查找连续的数字：

以下代码是从this question的user：39991（truppo）获得的，此代码的作用是能够找到列表中的连续整数。由于我们的值是一个整数列表，因此我们将每个值传递给此函数以标识该函数的连续部分。

def group(L):
    first = last = L[0]
    for n in L[1:]:
        if n - 1 == last: # Part of the group, bump the end
            last = n
        else: # Not part of the group, yield current group and start a new
            yield first, last
            first = last = n
    yield first, last # Yield the last group

创建一个存储结果的集合

resultSet = set()
for key, value in stringOccurrence.items():

    for tup in list(group(value)):

        resultSet.add((key, tup[1] - tup[0] + 1))

print(resultSet)

应该给您：

{('bye', 4), 
 ('am', 1), 
 ('foo', 1), 
 ('I', 1), 
 ('hello', 3), 
 ('hello', 2), 
 ('a', 1), 
 ('mushroom', 1), 
 ('big', 1)}

查找句子中单词序列的连续重复

2 个答案:

代码

示例

输出