有效地生成字符串的所有可能子字符串的列表

时间:2019-02-24 17:50:51

标签: python c++ c string list

我想编写一个函数,该函数根据子字符串的最小和最大长度有效地返回字符串的所有可能子字符串的列表。 (字符串仅包含大写字母。)

例如,对于字符串/** * Render an exception into an HTTP response. * * @param \Illuminate\Http\Request $request * @param \Exception $exception * @return \Illuminate\Http\Response */ public function render($request, Exception $exception) { if ($exception instanceof \Illuminate\Http\Exceptions\PostTooLargeException) { return \Illuminate\Support\Facades\Redirect::back()->withErrors(['msg' => 'The Message']); } return parent::render($request, $exception); } 'THISISASTRING'min_length=3,它应返回:

max_length=4

我正在寻找比我当前的解决方案快得多的解决方案:

['THI', 'THIS', 'HIS', 'HISI', 'ISI', 'ISIS', 'SIS', 'SISA', 'ISA',
 'ISAS', 'SAS', 'SAST', 'AST', 'ASTR', 'STR', 'STRI', 'TRI', 'TRIN',
 'RIN', 'RING', 'ING']

这给了我

import cProfile

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

def assemble_substrings(textstring, length_min, length_max):
    str_len = len(textstring)
    subStringList = []
    idx = 0
    while idx <= str_len - length_min:
        max_depth = min(length_max, str_len - idx)
        for i in list(range(length_min, max_depth + 1)):
            subString = textstring[idx:idx + i]
            subStringList.append(subString)
        idx += 1
    return subStringList


pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = assemble_substrings(textstring=random_english_text, length_min=4, length_max=10)

pr.disable()
pr.print_stats(sort='cumtime')

现在,从探查器的输出中,我对如何加快此功能的速度了解不多。

使此功能尽可能快的最佳方法是什么? 我应该使用与列表不同的数据结构吗? 使用Cython?还是在外部C / C ++共享对象中编写此代码?

对于输入将是高度赞赏的,通常也将涉及如何有效地处理类似于上面在Python中对其进行处理的字符串和操作。

3 个答案:

答案 0 :(得分:3)

为什么不简单地在两个范围内使用列表理解和字符串切片呢?

t = "SOMETEXT"

print(t)

minl = 3
maxl = 8

parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]

print(parts)

输出:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'METEXT', 'METEXT', 'ETE', 'ETEX', 
 'ETEXT', 'ETEXT', 'ETEXT', 'ETEXT', 'TEX', 'TEXT', 'TEXT', 'TEXT', 'TEXT', 'TEXT']

如果顺序不重要,则可以使用集合删除重复项-否则,为顺序存储创建唯一列表:

nodupes = [] 
k = set() 
for l in parts:
    if l in k:
        pass
    else:
        nodupes.append(l)
        k.add(l)

print(nodupes)   

输出:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'ETE', 'ETEX', 'ETEXT', 'TEX', 'TEXT']

有时间安排

def doit(t,minl,maxl):
    parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]
    return parts

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = doit(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

         3001 function calls in 0.597 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.001    0.000    0.597    0.001 main.py:10(doit)
     1000    0.596    0.001    0.596    0.001 main.py:11(<listcomp>)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

您给的:4181001 function calls in 1.614 seconds

答案 1 :(得分:0)

我不确定它到底有多快( read :需要进一步研究),但是对我来说,这听起来像是正则表达式的任务(re模块),我会做它的方式如下:

import re
minlen = 3
maxlen = 4
s = 'THISISASTRING'
out = []
for i in range(minlen,maxlen+1):
    p = re.compile('(?=(.{'+str(i)+'}))',re.DOTALL)
    out = out+p.findall(s)
print(out)

输出:

['THI', 'HIS', 'ISI', 'SIS', 'ISA', 'SAS', 'AST', 'STR', 'TRI', 'RIN', 'ING', 'THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING']

我使用了topic bernie 答案,以使findall以重叠的方式工作。我知道这个特定的零长度声明可以利用可变长度模式,但是当我执行re.findall('(?=(.{3,4}))','THISISASTRING')时,它会产生['THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING', 'ING'],这是不需要的输出。因此,我提出了混合的for-re解决方案,对于特定长度的字符串,每转一圈。我必须承认,我在re方面的能力不足,无法使其以单遍方式(仅re,而没有for)工作,但是也许其他一些用户会能做到吗?

答案 2 :(得分:0)

您可以将’’.join()映射到压缩字符串:

def func(s, min_l, max_l):
    return [subl for i in range(min_l, max_l + 1)
                 for subl in map(''.join, zip(*[s[i:] for i in range(i)]))]

个人资料:

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = func(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

输出:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000    0.002    0.000    0.772    0.001 Untitled.py:3(func)
  7000    0.014    0.000    0.014    0.000 Untitled.py:4(<listcomp>)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}