我正在编写一个python脚本,我必须在许多字符串序列中找到常见的子字符串。 例如:
sequence1 = 'mweitngaomjksjasper;36nnG1bmaso3th7a\-'
sequence2 = 'asngiqbwebs7-236jasper;u52dsv--4512G1b'
sequence3 = 'asvjaspermininwqmamnf-121xvxnesgq232'
jasper发生3次 - 在sequence1,sequence2和sequence3中各出现一次。 G1b发生2次 - 序列1中一次,序列2中一次。
对于每次出现两次或更多次数的子字符串,我需要将它们添加到字典中,作为substring =>计数。 在这种情况下,我的字典将是:
dict = { 'jasper': '3', 'G1b': '2'}
我将使用数千个序列填充此字典,如果一个子字符串在任何序列中出现两次或更多次, 需要添加到这本词典中。在不破坏系统的情况下,最好的方法是什么?
答案 0 :(得分:1)
这是一种方法:
def all_prefixes(x, minlen):
for i in range(minlen, len(x)):
yield x[:i]
def all_substrings(x, minlen=1):
if len(x) < minlen:
return
yield from all_prefixes(x, minlen)
yield from all_substrings(x[1:], minlen)
from collections import Counter
words = [
'mweitngaomjksjasper;36nnG1bmaso3th7a\-',
'asngiqbwebs7-236jasper;u52dsv--4512G1b',
'asvjaspermininwqmamnf-121xvxnesgq232'
]
print(dict((k,v) for k,v in Counter(x for w in words for x in all_substrings(w, minlen=3)).items() if v >= 2))
打印所有子串的计数至少两次,最小长度为3:
{'jasper': 3, 'jasper;': 2, 'asper;': 2, 'sper': 3, 'er;': 2, 'jasp': 3, 'per;': 2, 'spe': 3, 'jas': 3, 'asp': 3, 'asper': 3, 'aspe': 3, 'per': 3, 'sper;': 2, 'jaspe': 3}
答案 1 :(得分:0)
首先,我们将编写一个快速的小生成器,它接受一个字符串并生成该字符串的每个子字符串
from collections import Counter
import itertools
def substrings(s):
for i in range(len(s)):
for j in range(i+1, len(s)+1):
yield s[i:j]
sequences = ['mweitngaomjksjasper;36nnG1bmaso3th7a\-',
'asngiqbwebs7-236jasper;u52dsv--4512G1b',
'asvjaspermininwqmamnf-121xvxnesgq232']
c = Counter(itertools.chain.from_iterable(s for s in map(substrings, sequences)))
然后我们可以使用itertools.takewhile
仅获取多次出现的子串
print(list(itertools.takewhile(lambda x: x[1] > 1, c.most_common())))
打印
[('s', 10), ('a', 9), ('n', 8), ('2', 6), ('e', 6), ('as', 6), ('m', 6), ('1', 5), ('-', 5), ('3', 4), ('i', 4), ('j', 4), ('b', 4), ('q', 3), ('er', 3), ('r', 3), ('asper', 3), ('g', 3), ('per', 3), ('v', 3), ('jaspe', 3), ('ja', 3), ('sp', 3), ('spe', 3), ('aspe', 3), ('sper', 3), ('jas', 3), ('asp', 3), ('w', 3), ('jasper', 3), ('p', 3), ('pe', 3), ('jasp', 3), ('o', 2), ('ma', 2), ('r;', 2), ('23', 2), ('12', 2), ('jasper;', 2), ('1b', 2), ('G1b', 2), ('asper;', 2), ('t', 2), ('sv', 2), ('5', 2), ('36', 2), ('per;', 2), ('x', 2), ('in', 2), ('6', 2), ('G1', 2), ('G', 2), ('7', 2), ('er;', 2), ('we', 2), (';', 2), ('ng', 2), ('sper;', 2)]