Question

首先，我有一个大单词列表：

words = ['about', 'black', 'red', ...]  # nums: 20000+

然后，给定一个字符串，例如：

s = 'blackingabouthahah'

我想获得['black', 'about']

我尝试使用正则表达式来执行此操作：

pattern = re.compile('|'.join(words))
print pattern.findall(s)

它有效，但我担心这种方法的速度和内存使用情况。

有更好的解决方案吗？

Answer 1

您可以使用.find使用理解来使用非正则表达式方法：

words = ['about', 'black', 'red']
s = 'blackingabouthahah'
print [x for x in words if s.find(x)>-1]

请参阅IDEONE demo

这将仅输出列表中唯一出现的术语。如果您需要计算所有出现次数：

words = ['about', 'black', 'red']
s = 'blackingabouthahahabout'
print [s.count(x) for x in words]

由于我没有看到第一个about和第二个about之间存在差异。请参阅another demo。

Answer 2

如果您只是想要打印我在这里有解决方案

   import re

   words = ['about', 'black', 'red',] 
   s = 'dsjhdgblackingabouthahah'

   for items in words:
      if re.search (items,s):
          print items

如果您想要一个新列表中的结果，您可以尝试：

 import re

 words = ['about', 'black', 'red',] 
 s = 'dsjhdgblackingabouthahah'
 mylist = []
 for items in words:
    if re.search (items,s):
       mylist.append( items)

 print mylist

Python从基于大词列表的字符串中提取单词

2 个答案: