我想检查一组句子,看看句子中是否出现了一些种子词。但我想避免使用for seed in line
,因为这会说种子词ring
会出现在带有bring
字样的文档中。
我还想检查文档中是否出现像word with spaces
这样的多字表达式(MWE)。
我已经尝试过了,但这种情况非常缓慢,有没有更快的方法呢?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
所需的输出应为:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]
答案 0 :(得分:3)
考虑使用正则表达式:
import re
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)
\b
匹配"字"的开始或结尾 (字符序列)。
示例:
>>> for line in docs:
... print pattern.findall(line)
...
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]
答案 1 :(得分:0)
这应该比你目前的方法更有效并且更快:
docs_seed = []
for d in docs:
for s in seed:
pos = d.find(s)
if not pos == -1 and (d[pos - 1] == " "
and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
docs_seed.append((s, d))
break
find
为我们提供了doc中seed
值的位置(如果未找到,则为-1),然后检查值前后的字符是否为空格(或者字符串在子字符串后结束)。这也解决了原始代码中多字表达式不需要在单词边界上开始或结束的错误 - 您的原始代码与"words with spaces"
之类的输入匹配"swords with spaces"
。