我有一个字符串,它包含一个很长的句子而没有空格/空格。
mystring = "abcdthisisatextwithsampletextforasampleabcd"
我想找到包含最少4个字符的所有重复子字符串。
所以我想实现这样的目标:
'text' 2 times
'sample' 2 times
'abcd' 2 times
由于abcd
,text
和sample
都可以在mystring
中找到两次,因此它们被识别为具有超过4个字符长度的正匹配子字符串。重要的是我要寻找重复的子串,找到只有现有的英语单词不是必需的。
我发现的答案有助于在空白文本中查找重复项,但是当字符串中没有空格和空格时,我找不到适当的资源来覆盖这种情况。如果有人能告诉我如何以最有效的方式完成这项工作,我将非常感激。
答案 0 :(得分:11)
让我们逐步进行此操作。您应该处理几个子任务:
您实际上可以将它们全部放入几个语句中。为了便于理解,一次比较容易。
以下示例全部使用
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
您可以通过切片轻松获得子字符串-例如,mystring[4:4+6]
为您提供长度6的位置4:'thisis'
的子字符串。更笼统地说,您希望使用mystring[start:start+length]
形式的子字符串。
那么start
和length
需要什么值?
start
必须...
start in range(0, ...)
。max_length
之前停止start in range(..., len(mystring) - max_length + 1)
个字符。length
必须...
length in range(min_length, ...)
。i
之后不超过剩余的字符串:length in range(..., len(mystring) - i + 1))
+1
项来自将长度(> = 1)转换为索引(> = 0)。
您可以将所有内容整合为一个理解:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
通常,您要对每个子字符串进行计数。为每个特定对象保留任何内容是dict
的目的。因此,您应该在dict
中使用子字符串作为键,并使用计数作为值。本质上,这对应于此:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
您只需将substrings
馈送到collections.Counter
,它就会产生类似于上面的内容。
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
请注意副本'abcd'
如何映射到计数2。
因此,现在您有了子字符串和每个字符串的计数。您需要删除非重复的子字符串-计数为1的子字符串。
Python提供了几种用于过滤的构造,具体取决于所需的输出。如果counts
是常规的dict
,这些功能也可以工作:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Python附带了一些原语,可以使您更有效地执行此操作。
使用生成器构建子字符串。生成器动态地构建其成员,因此您实际上不会将它们全部存储在内存中。对于您的用例,可以使用生成器表达式:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
使用预先存在的Counter实现。 Python带有一个类似dict
的容器,该容器对其成员进行计数:collections.Counter
可以直接消化您的子字符串生成器。尤其是在较新的版本中,效率更高。
counts = collections.Counter(substrings)
您可以利用Python的惰性过滤器仅检查一个子字符串。内置的filter
或另一个生成器生成器表达式可以一次生成一个结果,而无需将它们全部存储在内存中。
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')
答案 1 :(得分:5)
脚本(在需要时在注释中解释):
from collections import Counter
mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)
possible_matches = []
matches = []
# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
# Start `end_index` at `start_index+1` and range it throughout the rest of
# the string
for end_index in range(start_index+1, mystring_len+1):
current_string = mystring[start_index:end_index]
if len(current_string) < 4: continue # Skip this interation, if len < 4
possible_matches.append(mystring[start_index:end_index])
for possible_match, count in Counter(possible_matches).most_common():
# Iterate until count is less than or equal to 1 because `Counter`'s
# `most_common` method lists them in order. Once 1 (or less) is hit, all
# others are the same or lower.
if count <= 1: break
matches.append((possible_match, count))
for match, count in matches:
print(f'\'{match}\' {count} times')
输出:
'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times
答案 2 :(得分:5)
没有人使用re
!是时候[ab]使用正则表达式内置模块了;)
import re
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))
这匹配最长的子字符串,这些子字符串在之后至少有一个重复(不消耗)。因此,它会找到所有重复的不相交的子字符串,而这些子字符串只会产生最长的字符串。
mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0
while True:
match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
if match:
repeated_ones.add(match.group(1))
pos += match.pos + 1
else:
break
这可确保返回具有重复的 all (不仅不相交)子字符串。它应该慢得多,但是可以完成工作。
如果除了重复的最长字符串之外,还想要 all 子字符串,则:
base_repetitions = list(repeated_ones)
for s in base_repetitions:
for i in range(4, len(s)):
repeated_ones.add(s[:i])
这将确保对于具有重复的长子字符串,您还具有较小的子字符串-例如。由re.search
代码找到的“样本”和“样本”;还要在上述代码段中添加“ samp”,“ sampl”,“ ampl”。
由于(根据设计)我们计算的子字符串是不重叠的,因此count
方法是可行的方法:
from __future__ import print_function
for substr in repeated_ones:
print("'%s': %d times" % (substr, mystring.count(substr)))
带有问题的原始mystring
:
{'abcd', 'text', 'sample'}
使用mystring_overlap
示例:
{'abcd'}
带有问题的原始mystring
:
{'abcd', 'ample', 'mple', 'sample', 'text'}
...,如果我们添加代码以获取所有子字符串,那么,当然,我们将获得所有子字符串:
{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
使用mystring_overlap
示例:
{'abcd', 'bcde'}
可以通过以下步骤过滤查找所有子字符串的结果:
“ A_n 如果“ A_n> B_n”表示较小的子串存在一些额外的匹配,则它是一个独特的子串,因为它在不重复B的位置重复出现。
答案 3 :(得分:4)
$ cat test.py
import collections
import sys
S = "abcdthisisatextwithsampletextforasampleabcd"
def find(s, min_length=4):
"""
Find repeated character sequences in a provided string.
Arguments:
s -- the string to be searched
min_length -- the minimum length of the sequences to be found
"""
counter = collections.defaultdict(int)
# A repeated sequence can't be longer than half the length of s
sequence_length = len(s) // 2
# populate counter with all possible sequences
while sequence_length >= min_length:
# Iterate over the string until the number of remaining characters is
# fewer than the length of the current sequence.
for i, x in enumerate(s[:-(sequence_length - 1)]):
# Window across the string, getting slices
# of length == sequence_length.
candidate = s[i:i + sequence_length]
counter[candidate] += 1
sequence_length -= 1
# Report.
for k, v in counter.items():
if v > 1:
print('{} {} times'.format(k, v))
return
if __name__ == '__main__':
try:
s = sys.argv[1]
except IndexError:
s = S
find(s)
$ python test.py
sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times
答案 4 :(得分:4)
这是Python3友好的解决方案:
from collections import Counter
min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"
all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)
奖金:最大字符串
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2 ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))
所有功能:
from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):
all_substrings = [input_string[start_index:][:end_index + 1]
for start_index in range(len(input_string))
for end_index in range(len(input_string[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0]
for item in counted_substrings.most_common()
if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
### This is just a bit of bonus code for calculating the largest repeating sting
if calculate_largest_repeated_string == True:
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))
return counted_final_candidates, largest_common_strings
else:
return counted_final_candidates
示例:
mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))
输出:
({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])
答案 5 :(得分:4)
代码:
pattern = "abcdthisisatextwithsampletextforasampleabcd"
string_more_4 = []
k = 4
while(k <= len(pattern)):
for i in range(len(pattern)):
if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
string_more_4.append( pattern[i:k+i])
k+=1
for i in string_more_4:
if pattern.count(i) >= 2:
print(i + " -> " + str(pattern.count(i)) + " times")
输出:
abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
希望这会有所帮助,因为我的代码长度短且易于理解。干杯!
答案 6 :(得分:3)
这是在Python 2中,因为我此时没有使用Python 3。因此,您必须自己适应Python 3。
#!python2
# import module
from collections import Counter
# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []
# start building the indices
for i in range(0, length - 2):
# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])
# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break
# return all of the calculated indices ranges
return all_sets
# store search strings
tmplst = []; combos = []; found = []
# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"
# get length of string
length = len(mystring)
# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)
# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []
# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])
# make a dictionary containing the strings and their counts
d1 = Counter(found)
# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v
答案 7 :(得分:2)
这是我解决这个问题的方法:
def get_repeated_words(string, minimum_len):
# Storing count of repeated words in this dictionary
repeated_words = {}
# Traversing till last but 4th element
# Actually leaving `minimum_len` elements at end (in this case its 4)
for i in range(len(string)-minimum_len):
# Starting with a length of 4(`minimum_len`) and going till end of string
for j in range(i+minimum_len, len(string)):
# getting the current word
word = string[i:j]
# counting the occurrences of the word
word_count = string.count(word)
if word_count > 1:
# storing in dictionary along with its count if found more than once
repeated_words[word] = word_count
return repeated_words
if __name__ == '__main__':
mystring = "abcdthisisatextwithsampletextforasampleabcd"
result = get_repeated_words(mystring, 4)
答案 8 :(得分:2)
这是使用more_itertools
library的简单解决方案。
给出
import collections as ct
import more_itertools as mit
s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)
代码
windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered
输出
{'abcd': 2,
'text': 2,
'samp': 2,
'ampl': 2,
'mple': 2,
'sampl': 2,
'ample': 2,
'sample': 2}
详细信息
程序为:
lbound <= n < ubound
大小不同的sliding windows more_itertools
是> pip install more_itertools
安装的第三方软件包。
答案 9 :(得分:2)
这就是我要这样做的方式,但我不知道其他方式:
string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
for start in range(l - i):
substring = string[start:start + i]
occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
if occurences[key] > 1:
print("'" + key + "'", str(occurences[key]), "times")
输出:
'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times
有效,不是,但是易于理解,是。
答案 10 :(得分:0)
s = 'abcabcabcdabcd'
d = {}
def get_repeats(s, l):
for i in range(len(s)-l):
ss = s[i: i+l]
if ss not in d:
d[ss] = 1
else:
d[ss] = d[ss]+1
return d
get_repeats(s, 3)