如何在python中找到常见的子字符串?

时间:2017-01-25 13:44:36

标签: python python-3.x

我正在编写一个python脚本,我必须在许多字符串序列中找到常见的子字符串。 例如:

sequence1 = 'mweitngaomjksjasper;36nnG1bmaso3th7a\-'
sequence2 = 'asngiqbwebs7-236jasper;u52dsv--4512G1b'
sequence3 = 'asvjaspermininwqmamnf-121xvxnesgq232'

jasper发生3次 - 在sequence1,sequence2和sequence3中各出现一次。 G1b发生2次 - 序列1中一次,序列2中一次。

对于每次出现两次或更多次数的子字符串,我需要将它们添加到字典中,作为substring =>计数。 在这种情况下,我的字典将是:

dict = { 'jasper': '3', 'G1b': '2'}

我将使用数千个序列填充此字典,如果一个子字符串在任何序列中出现两次或更多次, 需要添加到这本词典中。在不破坏系统的情况下,最好的方法是什么?

2 个答案:

答案 0 :(得分:1)

这是一种方法:

def all_prefixes(x, minlen):
  for i in range(minlen, len(x)):
    yield x[:i]


def all_substrings(x, minlen=1):
  if len(x) < minlen:
    return
  yield from all_prefixes(x, minlen)
  yield from all_substrings(x[1:], minlen)


from collections import Counter
words = [
  'mweitngaomjksjasper;36nnG1bmaso3th7a\-',
  'asngiqbwebs7-236jasper;u52dsv--4512G1b',
  'asvjaspermininwqmamnf-121xvxnesgq232'
]
print(dict((k,v) for k,v in Counter(x for w in words for x in all_substrings(w, minlen=3)).items() if v >= 2))

打印所有子串的计数至少两次,最小长度为3:

{'jasper': 3, 'jasper;': 2, 'asper;': 2, 'sper': 3, 'er;': 2, 'jasp': 3, 'per;': 2, 'spe': 3, 'jas': 3, 'asp': 3, 'asper': 3, 'aspe': 3, 'per': 3, 'sper;': 2, 'jaspe': 3}

答案 1 :(得分:0)

首先,我们将编写一个快速的小生成器,它接受一个字符串并生成该字符串的每个子字符串

from collections import Counter
import itertools

def substrings(s):
    for i in range(len(s)):
        for j in range(i+1, len(s)+1):
            yield s[i:j]

sequences = ['mweitngaomjksjasper;36nnG1bmaso3th7a\-',
             'asngiqbwebs7-236jasper;u52dsv--4512G1b',
             'asvjaspermininwqmamnf-121xvxnesgq232']

c = Counter(itertools.chain.from_iterable(s for s in map(substrings, sequences)))

然后我们可以使用itertools.takewhile仅获取多次出现的子串

print(list(itertools.takewhile(lambda x: x[1] > 1, c.most_common())))

打印

[('s', 10), ('a', 9), ('n', 8), ('2', 6), ('e', 6), ('as', 6), ('m', 6), ('1', 5), ('-', 5), ('3', 4), ('i', 4), ('j', 4), ('b', 4), ('q', 3), ('er', 3), ('r', 3), ('asper', 3), ('g', 3), ('per', 3), ('v', 3), ('jaspe', 3), ('ja', 3), ('sp', 3), ('spe', 3), ('aspe', 3), ('sper', 3), ('jas', 3), ('asp', 3), ('w', 3), ('jasper', 3), ('p', 3), ('pe', 3), ('jasp', 3), ('o', 2), ('ma', 2), ('r;', 2), ('23', 2), ('12', 2), ('jasper;', 2), ('1b', 2), ('G1b', 2), ('asper;', 2), ('t', 2), ('sv', 2), ('5', 2), ('36', 2), ('per;', 2), ('x', 2), ('in', 2), ('6', 2), ('G1', 2), ('G', 2), ('7', 2), ('er;', 2), ('we', 2), (';', 2), ('ng', 2), ('sper;', 2)]