在字符串中查找重复的字符组合

时间:2018-03-11 17:01:03

标签: python python-3.x

我有一个字符串,它包含一个很长的句子而没有空格/空格。

mystring = "abcdthisisatextwithsampletextforasampleabcd"

我想找到包含最少4个字符的所有重复子字符串。

所以我想实现这样的目标:

'text' 2 times
'sample' 2 times
'abcd' 2 times

由于abcdtextsample都可以在mystring中找到两次,因此它们被识别为具有超过4个字符长度的正匹配子字符串。重要的是我要寻找重复的子串,找到只有现有的英语单词不是必需的。

我发现的答案有助于在空白文本中查找重复项,但是当字符串中没有空格和空格时,我找不到适当的资源来覆盖这种情况。如果有人能告诉我如何以最有效的方式完成这项工作,我将非常感激。

11 个答案:

答案 0 :(得分:11)

让我们逐步进行此操作。您应该处理几个子任务:

  1. 标识所有长度为4或更大的子字符串。
  2. 计算这些子字符串的出现。
  3. 过滤所有出现2次或更多的子字符串。

您实际上可以将它们全部放入几个语句中。为了便于理解,一次比较容易。

以下示例全部使用

mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4

1。给定长度的子字符串

您可以通过切片轻松获得子字符串-例如,mystring[4:4+6]为您提供长度6的位置4:'thisis'的子字符串。更笼统地说,您希望使用mystring[start:start+length]形式的子字符串。

那么startlength需要什么值?

  • start必须...
    • 覆盖所有子字符串,因此它必须包含第一个字符:start in range(0, ...)
    • 不会映射到短子字符串,因此它可以在结尾max_length之前停止start in range(..., len(mystring) - max_length + 1)个字符。
  • length必须...
    • 找出长度为4的最短子串:length in range(min_length, ...)
    • i之后不超过剩余的字符串:length in range(..., len(mystring) - i + 1))

+1项来自将长度(> = 1)转换为索引(> = 0)。 您可以将所有内容整合为一个理解:

substrings = [
    mystring[i:i+j]
    for i in range(0, len(mystring) - min_length + 1)
    for j in range(min_length, len(mystring) - i + 1)
]

2。计算子字符串

通常,您要对每个子字符串进行计数。为每个特定对象保留任何内容dict的目的。因此,您应该在dict中使用子字符串作为键,并使用计数作为值。本质上,这对应于此:

counts = {}
for substring in substrings:
    try:  # increase count for existing keys, set for new keys
         counts[substring] += 1
    except KeyError:
         counts[substring] = 1

您只需将substrings馈送到collections.Counter,它就会产生类似于上面的内容。

>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})

请注意副本'abcd'如何映射到计数2。

3。过滤重复的子字符串

因此,现在您有了子字符串和每个字符串的计数。您需要删除非重复的子字符串-计数为1的子字符串。

Python提供了几种用于过滤的构造,具体取决于所需的输出。如果counts是常规的dict,这些功能也可以工作:

>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}

使用Python原语

Python附带了一些原语,可以使您更有效地执行此操作。

  1. 使用生成器构建子字符串。生成器动态地构建其成员,因此您实际上不会将它们全部存储在内存中。对于您的用例,可以使用生成器表达式:

    substrings = (
        mystring[i:i+j]
        for i in range(0, len(mystring) - min_length + 1)
        for j in range(min_length, len(mystring) - i + 1)
    )
    
  2. 使用预先存在的Counter实现。 Python带有一个类似dict的容器,该容器对其成员进行计数:collections.Counter可以直接消化您的子字符串生成器。尤其是在较新的版本中,效率更高。

    counts = collections.Counter(substrings)
    
  3. 您可以利用Python的惰性过滤器仅检查一个子字符串。内置的filter或另一个生成器生成器表达式可以一次生成一个结果,而无需将它们全部存储在内存中。

    for substring in filter(lambda key: counts[key] > 1, counts):
        print(substring, 'occurs', counts[substring], 'times')
    

答案 1 :(得分:5)

脚本(在需要时在注释中解释):

from collections import Counter

mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)

possible_matches = []
matches = []

# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
    # Start `end_index` at `start_index+1` and range it throughout the rest of
    # the string
    for end_index in range(start_index+1, mystring_len+1):
        current_string = mystring[start_index:end_index]
        if len(current_string) < 4: continue # Skip this interation, if len < 4
        possible_matches.append(mystring[start_index:end_index])

for possible_match, count in Counter(possible_matches).most_common():
    # Iterate until count is less than or equal to 1 because `Counter`'s
    # `most_common` method lists them in order. Once 1 (or less) is hit, all
    # others are the same or lower.
    if count <= 1: break
    matches.append((possible_match, count))

for match, count in matches:
    print(f'\'{match}\' {count} times')

输出:

'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times

答案 2 :(得分:5)

没有人使用re!是时候[ab]使用正则表达式内置模块了;)

import re

查找所有重复的最大子字符串

repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))

这匹配最长的子字符串,这些子字符串在之后至少有一个重复(不消耗)。因此,它会找到所有重复的不相交的子字符串,而这些子字符串只会产生最长的字符串。

查找所有重复的子字符串,包括重叠

mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0

while True:
    match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
    if match:
        repeated_ones.add(match.group(1))
        pos += match.pos + 1
    else:
        break

这可确保返回具有重复的 all (不仅不相交)子字符串。它应该慢得多,但是可以完成工作。

如果除了重复的最长字符串之外,还想要 all 子字符串,则:

base_repetitions = list(repeated_ones)

for s in base_repetitions:
    for i in range(4, len(s)):
        repeated_ones.add(s[:i])

这将确保对于具有重复的长子字符串,您还具有较小的子字符串-例如。由re.search代码找到的“样本”和“样本”;还要在上述代码段中添加“ samp”,“ sampl”,“ ampl”。

计数比赛

由于(根据设计)我们计算的子字符串是不重叠的,因此count方法是可行的方法:

from __future__ import print_function
for substr in repeated_ones:
    print("'%s': %d times" % (substr, mystring.count(substr)))

结果

查找最大子字符串:

带有问题的原始mystring

{'abcd', 'text', 'sample'}

使用mystring_overlap示例:

{'abcd'}

查找所有子字符串:

带有问题的原始mystring

{'abcd', 'ample', 'mple', 'sample', 'text'}

...,如果我们添加代码以获取所有子字符串,那么,当然,我们将获得所有子字符串:

{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}

使用mystring_overlap示例:

{'abcd', 'bcde'}

未来的工作

可以通过以下步骤过滤查找所有子字符串的结果:

  • 参加比赛“ A”
  • 检查此匹配项是否为另一个匹配项的子字符串,将其称为“ B”
  • 如果有匹配项“ B”,请检查匹配项“ B_n”上的计数器
  • 如果“ A_n = B_n”,则删除A
  • 转到第一步

“ A_n

如果“ A_n> B_n”表示较小的子串存在一些额外的匹配,则它是一个独特的子串,因为它在不重复B的位置重复出现。

答案 3 :(得分:4)

$ cat test.py

import collections
import sys 


S = "abcdthisisatextwithsampletextforasampleabcd"


def find(s, min_length=4):
    """ 
    Find repeated character sequences in a provided string.

    Arguments:
    s -- the string to be searched
    min_length -- the minimum length of the sequences to be found
    """
    counter = collections.defaultdict(int)
    # A repeated sequence can't be longer than half the length of s
    sequence_length = len(s) // 2
    # populate counter with all possible sequences
    while sequence_length >= min_length:
        # Iterate over the string until the number of remaining characters is 
        # fewer than the length of the current sequence.
        for i, x in enumerate(s[:-(sequence_length - 1)]):
            # Window across the string, getting slices
            # of length == sequence_length. 
            candidate = s[i:i + sequence_length]
            counter[candidate] += 1
        sequence_length -= 1

    # Report.
    for k, v in counter.items():
        if v > 1:
            print('{} {} times'.format(k, v)) 
    return



if __name__ == '__main__':
    try:
        s = sys.argv[1]
    except IndexError:
        s = S 
    find(s)

$ python test.py

sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times

答案 4 :(得分:4)

这是Python3友好的解决方案:

from collections import Counter

min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"

all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)

奖金:最大字符串

sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2    ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))

所有功能:

from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):

    all_substrings = [input_string[start_index:][:end_index + 1]
                      for start_index in range(len(input_string))
                      for end_index in range(len(input_string[start_index:]))]
    counted_substrings = Counter(all_substrings)
    not_counted_final_candidates = [item[0]
                                    for item in counted_substrings.most_common()
                                    if item[1] > 1 and len(item[0]) >= min_str_length]
    counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}

    ### This is just a bit of bonus code for calculating the largest repeating sting 

    if calculate_largest_repeated_string == True:
        sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
                       not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
        largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))

        return counted_final_candidates, largest_common_strings
    else:
        return counted_final_candidates

示例

mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))

输出

({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])

答案 5 :(得分:4)

代码:

pattern = "abcdthisisatextwithsampletextforasampleabcd"

string_more_4 = []
k = 4
while(k <= len(pattern)):
    for i in range(len(pattern)):
        if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
            string_more_4.append( pattern[i:k+i])
    k+=1

for i in string_more_4:
    if pattern.count(i) >= 2:
        print(i + " -> " +  str(pattern.count(i)) + " times")

输出:

abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times

希望这会有所帮助,因为我的代码长度短且易于理解。干杯!

答案 6 :(得分:3)

这是在Python 2中,因为我此时没有使用Python 3。因此,您必须自己适应Python 3。

#!python2

# import module
from collections import Counter

# get the indices
def getIndices(length):
    # holds the indices
    specific_range = []; all_sets = []

    # start building the indices
    for i in range(0, length - 2):

        # build a set of indices of a specific range
        for j in range(1, length + 2):
            specific_range.append([j - 1, j + i + 3])

            # append 'specific_range' to 'all_sets', reset 'specific_range'
            if specific_range[j - 1][1] == length:
                all_sets.append(specific_range)
                specific_range = []
                break

    # return all of the calculated indices ranges
    return all_sets

# store search strings
tmplst = []; combos = []; found = []

# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"

# get length of string
length = len(mystring)

# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)

# get the search string combinations
for sublst in all_sets:
    for subsublst in sublst:
        tmplst.append(mystring[subsublst[0]: subsublst[1]])
    combos.append(tmplst)
    tmplst = []

# search for matching string patterns
for sublst in all_sets:
    for subsublst in sublst:
        for sublstitems in combos:
            if mystring[subsublst[0]: subsublst[1]] in sublstitems:
                found.append(mystring[subsublst[0]: subsublst[1]])

# make a dictionary containing the strings and their counts
d1 = Counter(found)

# filter out counts of 2 or more and print them
for k, v in d1.items():
    if v > 1:
        print k, v

答案 7 :(得分:2)

这是我解决这个问题的方法:

def get_repeated_words(string, minimum_len):

    # Storing count of repeated words in this dictionary
    repeated_words = {}

    # Traversing till last but 4th element
    # Actually leaving `minimum_len` elements at end (in this case its 4)
    for i in range(len(string)-minimum_len):

        # Starting with a length of 4(`minimum_len`) and going till end of string
        for j in range(i+minimum_len, len(string)):

            # getting the current word
            word = string[i:j]

            # counting the occurrences of the word
            word_count = string.count(word)

            if word_count > 1:

                # storing in dictionary along with its count if found more than once
                repeated_words[word] = word_count

    return repeated_words

if __name__ == '__main__':              
    mystring = "abcdthisisatextwithsampletextforasampleabcd"
    result = get_repeated_words(mystring, 4)

答案 8 :(得分:2)

这是使用more_itertools library的简单解决方案。

给出

import collections as ct

import more_itertools as mit


s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)

代码

windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered

输出

{'abcd': 2,
 'text': 2,
 'samp': 2,
 'ampl': 2,
 'mple': 2,
 'sampl': 2,
 'ample': 2,
 'sample': 2}

详细信息

程序为:

  1. 构建lbound <= n < ubound大小不同的sliding windows
  2. 计算所有事件并过滤重复项

more_itertools> pip install more_itertools安装的第三方软件包。

答案 9 :(得分:2)

这就是我要这样做的方式,但我不知道其他方式:

string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
  for start in range(l - i):
    substring = string[start:start + i]
    occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
  if occurences[key] > 1:
    print("'" + key + "'", str(occurences[key]), "times")

输出:

'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times

有效,不是,但是易于理解,是。

答案 10 :(得分:0)

s = 'abcabcabcdabcd'

d = {}
def get_repeats(s, l):
    for i in range(len(s)-l):
        ss = s[i: i+l]
        if ss not in d:
            d[ss] = 1
        else: 
            d[ss] = d[ss]+1
    return d
        


get_repeats(s, 3)