给定列表["one", "two", "three"]
,如何确定每个单词是否存在于指定的字符串中?
单词列表很短(在我的情况下少于20个单词),但要搜索的字符串非常庞大(每次运行400,000个字符串)
我目前的实现使用re
来查找匹配项,但我不确定它是否是最好的方式。
import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)
finder = re.compile(regex_string)
string_to_be_searched = "one two three"
results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
if word in result_set:
print("%s in string" % word)
我的解决方案中的问题:
可能更简单的实施:
if word in string_to_be_searched
。但如果你正在寻找“三”,它就无法处理“三人”。更新
我接受了Aaron Hall的回答https://stackoverflow.com/a/21718896/683321,因为根据Peter Gibson的基准https://stackoverflow.com/a/21742190/683321,这个简单的版本具有最佳性能。如果您对此问题感兴趣,可以阅读所有答案并获得更好的视图。
实际上我忘记在原来的问题中提到另一个约束。该单词可以是短语,例如:word_list = ["one day", "second day"]
。也许我应该问另一个问题。
答案 0 :(得分:11)
Peter Gibson(下面)发现这个功能是这里最有效的答案。它可以保存在内存中的数据集(因为它会从要搜索的字符串中创建一个单词列表,然后是一组这些单词):
def words_in_string(word_list, a_string):
return set(word_list).intersection(a_string.split())
用法:
my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
print('One or more words found!')
将One or words found!
打印到stdout。
它 返回找到的实际单词:
for word in words_in_string(my_word_list, a_string):
print(word)
打印出来:
three
two
one
答案 1 :(得分:5)
为了满足自己的好奇心,我定时发布了解决方案。结果如下:
TESTING: words_in_str_peter_gibson 0.207071995735
TESTING: words_in_str_devnull 0.55300579071
TESTING: words_in_str_perreal 0.159866499901
TESTING: words_in_str_mie Test #1 invalid result: None
TESTING: words_in_str_adsmith 0.11831510067
TESTING: words_in_str_gnibbler 0.175446796417
TESTING: words_in_string_aaron_hall 0.0834425926208
TESTING: words_in_string_aaron_hall2 0.0266295194626
TESTING: words_in_str_john_pirie <does not complete>
有趣的是@AaronHall的解决方案
def words_in_string(word_list, a_string):
return set(a_list).intersection(a_string.split())
这是最快的,也是最短的之一!请注意,它不会处理单词旁边的标点符号,但从问题是否是必需项中不清楚。 @MIE和@ user3也提出了这个解决方案。
为什么两个解决方案无效,我看起来并不长。如果这是我的错误,请道歉。这是测试,评论和代码的代码。欢迎更正
from __future__ import print_function
import re
import string
import random
words = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
def random_words(length):
letters = ''.join(set(string.ascii_lowercase) - set(''.join(words))) + ' '
return ''.join(random.choice(letters) for i in range(int(length)))
LENGTH = 400000
RANDOM_STR = random_words(LENGTH/100) * 100
TESTS = (
(RANDOM_STR + ' one two three', (
['one', 'two', 'three'],
set(['one', 'two', 'three']),
False,
[True] * 3 + [False] * 7,
{'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
(RANDOM_STR + ' one two three four five six seven eight nine ten', (
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
set(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']),
True,
[True] * 10,
{'one': True, 'two': True, 'three': True, 'four': True, 'five': True, 'six': True,
'seven': True, 'eight': True, 'nine': True, 'ten':True}
)),
('one two three ' + RANDOM_STR, (
['one', 'two', 'three'],
set(['one', 'two', 'three']),
False,
[True] * 3 + [False] * 7,
{'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
(RANDOM_STR, (
[],
set(),
False,
[False] * 10,
{'one': False, 'two': False, 'three': False, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
(RANDOM_STR + ' one two three ' + RANDOM_STR, (
['one', 'two', 'three'],
set(['one', 'two', 'three']),
False,
[True] * 3 + [False] * 7,
{'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' three', (
['one', 'two', 'three'],
set(['one', 'two', 'three']),
False,
[True] * 3 + [False] * 7,
{'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' threesome', (
['one', 'two'],
set(['one', 'two']),
False,
[True] * 2 + [False] * 8,
{'one': True, 'two': True, 'three': False, 'four': False, 'five': False, 'six': False,
'seven': False, 'eight': False, 'nine': False, 'ten':False}
)),
)
def words_in_str_peter_gibson(words, s):
words = words[:]
found = []
for match in re.finditer('\w+', s):
word = match.group()
if word in words:
found.append(word)
words.remove(word)
if len(words) == 0: break
return found
def words_in_str_devnull(word_list, inp_str1):
return dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)
def words_in_str_perreal(wl, s):
i, swl, strwords = 0, sorted(wl), sorted(s.split())
for w in swl:
while strwords[i] < w:
i += 1
if i >= len(strwords): return False
if w != strwords[i]: return False
return True
def words_in_str_mie(search_list, string):
lower_string=string.lower()
if ' ' in lower_string:
result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
substr=lower_string[:lower_string.find(' ')]
if substr in search_list and substr not in result:
result+=substr
substr=lower_string[lower_string.rfind(' ')+1:]
if substr in search_list and substr not in result:
result+=substr
else:
if lower_string in search_list:
result=[lower_string]
def words_in_str_john_pirie(word_list, to_be_searched):
for word in word_list:
found = False
while not found:
offset = 0
# Regex is expensive; use find
index = to_be_searched.find(word, offset)
if index < 0:
# Not found
break
if index > 0 and to_be_searched[index - 1] != " ":
# Found, but substring of a larger word; search rest of string beyond
offset = index + len(word)
continue
if index + len(word) < len(to_be_searched) \
and to_be_searched[index + len(word)] != " ":
# Found, but substring of larger word; search rest of string beyond
offset = index + len(word)
continue
# Found exact word match
found = True
return found
def words_in_str_gnibbler(words, string_to_be_searched):
word_set = set(words)
found = []
for match in re.finditer(r"\w+", string_to_be_searched):
w = match.group()
if w in word_set:
word_set.remove(w)
found.append(w)
return found
def words_in_str_adsmith(search_list, big_long_string):
counter = 0
for word in big_long_string.split(" "):
if word in search_list: counter += 1
if counter == len(search_list): return True
return False
def words_in_string_aaron_hall(word_list, a_string):
def words_in_string(word_list, a_string):
'''return iterator of words in string as they are found'''
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
return list(words_in_string(word_list, a_string))
def words_in_string_aaron_hall2(word_list, a_string):
return set(word_list).intersection(a_string.split())
ALGORITHMS = (
words_in_str_peter_gibson,
words_in_str_devnull,
words_in_str_perreal,
words_in_str_mie,
words_in_str_adsmith,
words_in_str_gnibbler,
words_in_string_aaron_hall,
words_in_string_aaron_hall2,
words_in_str_john_pirie,
)
def test(alg):
for i, (s, possible_results) in enumerate(TESTS):
result = alg(words, s)
assert result in possible_results, \
'Test #%d invalid result: %s ' % (i+1, repr(result))
COUNT = 10
if __name__ == '__main__':
import timeit
for alg in ALGORITHMS:
print('TESTING:', alg.__name__, end='\t\t')
try:
print(timeit.timeit(lambda: test(alg), number=COUNT)/COUNT)
except Exception as e:
print(e)
答案 2 :(得分:1)
def words_in_str(s, wl):
i, swl, strwords = 0, sorted(wl), sorted(s.split())
for w in swl:
while strwords[i] < w:
i += 1
if i >= len(strwords): return False
if w != strwords[i]: return False
return True
答案 3 :(得分:1)
你可以试试这个:
list(set(s.split()).intersection(set(w)))
它只返回单词列表中匹配的单词。如果没有匹配的单词,它将返回空列表。
答案 4 :(得分:0)
如果您的字符串很长且搜索列表很短,请执行以下操作:
def search_string(big_long_string,search_list)
counter = 0
for word in big_long_string.split(" "):
if word in search_list: counter += 1
if counter == len(search_list): return True
return False
答案 5 :(得分:0)
简单方法:
filter(lambda x:x in string,search_list)
如果你想让搜索忽略字符的情况,你可以这样做:
lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)
如果你想忽略大字的一部分,比如三人组中的三个:
lower_string=string.lower()
result=[]
if ' ' in lower_string:
result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
substr=lower_string[:lower_string.find(' ')]
if substr in search_list and substr not in result:
result+=[substr]
substr=lower_string[lower_string.rfind(' ')+1:]
if substr in search_list and substr not in result:
result+=[substr]
else:
if lower_string in search_list:
result=[lower_string]
<小时/> 如果需要性能:
arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))
编辑:这个方法在一个示例中最快,在一个包含400,000个单词的字符串中搜索1,000个单词,但如果我们将字符串增加到4,000,000,则前一个方法更快。
<小时/> 如果字符串太长,您应该进行低级搜索并避免将其转换为列表:
def safe_remove(arr,elem):
try:
arr.remove(elem)
except:
pass
not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
safe_remove(not_found,string[i+1:j])
i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])
not_found
列表包含未找到的字词,您可以轻松获取找到的列表,单向list(set(search_list)-set(not_found))
编辑:最后一种方法似乎是最慢的。
答案 6 :(得分:0)
您可以使用字边界:
>>> import re
>>> word_list = ["one", "two", "three"]
>>> inp_str = "This line not only contains one and two, but also three"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
... print "Found all words in the list"
...
Found all words in the list
>>> inp_str = "This line not only contains one and two, but also threesome"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
... print "Found all words in the list"
...
>>> inp_str = "This line not only contains one and two, but also four"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
... print "Found all words in the list"
...
>>>
编辑:正如您的评论所示,您似乎正在寻找字典:
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)
{'three': True, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str2))) for word in word_list)
{'three': False, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str3))) for word in word_list)
{'three': False, 'two': True, 'one': True}
答案 7 :(得分:0)
如果订单不太重要,您可以使用此方法
word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"
for w in string_to_be_searched.split():
if w in word_set:
print("%s in string" % w)
word_set.remove(w)
.split()
会创建一个列表,可能成为您的400k字符串的问题。但如果你有足够的内存,你就完成了。
当然可以修改for循环以避免创建整个列表。 re.finditer
或使用str.find
的生成器是显而易见的选择
import re
word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"
for match in re.finditer(r"\w+", string_to_be_searched):
w = match.group()
if w in word_set:
print("%s in string" % w)
word_set.remove(w)
答案 8 :(得分:0)
鉴于你的评论
我实际上并没有找到一个bool值,而是我正在寻找 用于将字词映射到bool的字典。此外,我可能需要进行一些测试 并查看多次运行re.search并运行的性能 re.findall一次。 - yegle
我建议以下
import re
words = ['one', 'two', 'three']
def words_in_str(words, s):
words = words[:]
found = []
for match in re.finditer('\w+', s):
word = match.group()
if word in words:
found.append(word)
words.remove(word)
if len(words) == 0: break
return found
assert words_in_str(words, 'three two one') == ['three', 'two', 'one']
assert words_in_str(words, 'one two. threesome') == ['one', 'two']
assert words_in_str(words, 'nothing of interest here one1') == []
这将返回按顺序找到的单词列表,但您可以轻松修改它以根据需要返回dict{word:bool}
。
优点:
答案 9 :(得分:0)
这是一个简单的生成器,对于大字符串或文件更好,因为我在下面的部分中进行了修改。
请注意,这应该非常快,但只要字符串继续而不击中所有单词,它将继续。在Peter Gibson的基准测试中排在第二位:Python: how to determine if a list of words exist in a string
要获得更快的短字符串解决方案,请参阅我的其他答案:Python: how to determine if a list of words exist in a string
import re
def words_in_string(word_list, a_string):
'''return iterator of words in string as they are found'''
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set: # then we've found all words
# break out of generator, closing file
raise StopIteration
它通过字符串产生找到它们的单词,在找到所有单词后放弃搜索,或者如果它到达字符串的末尾。
用法:
word_list = ['word', 'foo', 'bar']
a_string = 'A very pleasant word to you.'
for word in words_in_string(word_list, a_string):
print word
word
感谢Peter Gibson发现这是第二快的方法。我为解决方案感到自豪。由于最好的用例是通过一个巨大的文本流,让我在这里调整上述函数来处理文件。请注意,如果单词在换行符上被破坏,则不会捕获它们,但这里的任何其他方法都不会。但
import re
def words_in_file(word_list, a_file_path):
'''
return a memory friendly iterator of words as they are found
in a file.
'''
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
with open(a_file_path, 'rU') as a_file:
for line in a_file:
for found_word in re.finditer(pattern, line):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set: # then we've found all words
# break out of generator, closing file
raise StopIteration
为了演示,让我们写一些数据:
file_path = '/temp/temp/foo.txt'
with open(file_path, 'w') as f:
f.write('this\nis\nimportant\ndata')
和用法:
word_list = ['this', 'is', 'important']
iterator = words_in_file(word_list, file_path)
我们现在有一个迭代器,如果我们用列表来消费它:
list(iterator)
它返回:
['this', 'is', 'important']