Question

我有三个字符串，它是三个组件的串联：

列表1中的一个单词（包含空字符串）
列表2中的一个字
列表3中的一个单词（包含空字符串）

列表1,2和3最多可包含5000个元素。一个列表中的元素不在其他列表中（空字符串除外）。但是，有些词可以成为其他词的一部分。

我正在寻找一种有效的方法来找到这三个组件。现在我正在做以下事情：

for word in list2:
    if word in long_word:
        try:
           [bef, aft] = long_word.split(word)
        except ValueError: # too many values to unpack
           continue
        if bef in list1 and aft in list3:
           print('Found: {}, {}, {}'.format(bef, word, aft))
           break
else:
    print('Not found')

我想知道是否有更好的方法。我想在正则表达式中使用管道。但似乎替代方案的数量太大了：OverflowError：超出正则表达式代码大小限制。

谢谢，

更新

我尝试了所提议解决方案的修改版本：

def fj(long_word, list1, list2, list3):
    for x in filter(long_word.startswith, list1):
        for y in filter(long_word[len(x):].startswith, list2):
            z = long_word[len(x)+len(y):]
            if z in list3:
                yield x, y, z

def sid(long_word, list1, list2, list3):
    for w1 in list1:
        if not long_word.startswith(w1):
            continue
        cut1 = long_word[len(w1):]
        for w2 in list2:
           if not cut1.startswith(w2):
               continue
           cut2 = cut1[len(w2):]
           for w3 in list3:
               if cut2 == w3:
                   yield w1, w2, w3

def my(long_word, list1, list2, list3):
    for word in list2:
        if word in long_word:
            try:
               [bef, aft] = long_word.split(word)
            except ValueError: # too many values to unpack
               continue
            if bef in list1 and aft in list3:
               yield bef, word, aft

这是我使用重复10000次的8000个元素的列表的时间（标准化）结果，每次从每个列表中随机挑选一个单词以生成long_word

my：1.0
sid：4.5
fj：2.7

我真的很惊讶，因为我认为fj的方法最快。

Answer 1

正则表达式可能不适合这里，我可能会这样做：

for x in filter(long_word.startswith, list1):
    for y in filter(long_word[len(x):].startswith, list2):
        z = long_word[len(x)+len(y):]
        if z in list3:
            print('Found: {}, {}, {}'.format(x, y, z))
            break
    else:
        continue
    break
else:
    print('Not found')

Answer 2

一个天真的算法是运行3个循环：

for w1 in list1:
    p1=re.match(w1,s)
    if p1==None:
        continue
    for w2 in list2:
       p2=re.match(w2,s[p1.pos+len(w1):])  
       if p2==None:
         continue
       for w3 in list3:
           p3=re.match(w3,s[p2.pos+len(w2):])

我认为你仍然坚持list1的子串是list2的一部分。 F.J的方法可能会更好。

Answer 3

我的回答并没有完全回答你的问题，但它确实提醒我们在这个问题上我们正在处理什么。

列表1,2和3最多可包含5000个元素。

这意味着列表1,2和3是finite regular languages。从现在开始，我将列表1表示为A，列表2表示为B，列表3表示为C.

列表1中的一个单词（包含空字符串）

列表2中的一个字

列表3中的一个单词（包含空字符串）

因此，空字符串（lambda）在A和C中。

你有一个字符串w，可以用

的形式写出来

w = abc

其中a是A中的字符串，b是B中的字符串，c是C中的字符串。

您尝试做的是将w分成子串a，b和c。

由于a可以为空且c可以为空，因此您有以下可能性：

w = abc
w = ab
w = bc
w = b

对于初学者来说，让我们消除＃4的微不足道的情景。

if w in B:
  a = ""
  b = w
  c = ""
  print('Found: {}, {}, {}'.format(a, b, c))

我想到的更多信息。

使用正则表达式在Python中查找字符串的三个部分

更新

3 个答案: