Question

我有一个存储在列表中的单词列表：

[
    'investment',
    'property',
    'something',
    'else',
    'vest'
]

我也有一个字符串列表，如此

[
    'investmentproperty',
    'investmentsomethingproperty',
    'investmentsomethingelseproperty',
    'abcinvestmentproperty',
    'investmentabcproperty'
]

鉴于这个单词列表和字符串列表，我需要确定哪些字符串包含单词列表中的仅单词，并且这些单词的最大数量。

在上面的例子中，如果单词的最大数量是3，那么只有字符串列表中的前两个项目才会匹配（即使＆＃39; vest＆＃39;是＆＃39;投资＆＃ 39;

这个例子简化了单词列表和字符串列表 - 实际上有数千个单词和数十万个字符串。所以这需要高效。所有字符串都不包含空格。

我试过像这样构建一个正则表达式：

^(?:(word1)|(word2)|(word3)){1,3}$

但是对于单词列表中的单词数量（目前为10,000），这种情况很慢。

由于

Answer 1

我认为这是你除了

代码：

strings = [
    'investmentproperty',
    'investmentsomethingproperty',
    'investmentsomethingelseproperty',
    'abcinvestmentproperty',
    'investmentabcproperty'
]
words = [
    'investment',
    'property',
    'something',
    'else'
]
new_words =filter(lambda x: [x for i in words if x in i and x != i] == [], words)
res = list()
for string in strings:
    len_string = len(string)
    in_words = []
    for w in new_words:
        if w in string:
            in_words.append(w)
    if len(''.join(in_words)) == len_string:
        res.append(string)
print res

输出：

['investmentproperty', 'investmentsomethingproperty', 'investmentsomethingelseproperty']

Answer 2

如果我的两分钱很重要，我会将包含您正在寻找的关键词的列表转换为字典，这样您就不必继续迭代两个列表。

我忘了看看this Aho-Corasick算法，这可能对你很有帮助

如果不感兴趣，请按照以下

1-如果您想保留两个列表here

matchers = ['investment', 'property', 'something', 'else', 'vest']
matching = [s for s in my_list if any(xs in s for xs in matchers)]

2-或者

reduce((lambda x, y: x+len(filter((lambda z, x=y: z == x), list2))), list1, 0)

3- this听起来也非常有趣，看起来像是正则表达式的一个很好的替代品

对于限制匹配数量的其他要求，也许你可以添加一个在达到匹配数时中断的while循环。如果字典将您尝试查找的单词设置为键并将其所有值设置为1，则每次找到单词时都会添加值，直到达到目标为止。

Answer 3

您期待多长时间？我测试了以下代码：

_list = ['investmentproperty'] * 100000
_dict = [
    'investment',
    'property',
    'something',
    'else'
] * 1000
regex = re.compile("^(?:" + "|".join(_dict) + "){1,3}$")

for i in _list:
    result = regex.match(i)
#cost 5.06s

for i in _list:
    result = re.match("^(?:" + "|".join(_dict) + "){1,3}$", i)
#cost 11.04s

我认为有100000长度列表和4000长度字典，这不是一个糟糕的表现，对吧？

匹配一个没有空格的字符串，如果它包含单词列表中的单词

3 个答案: