我有两个看起来像这样的名称(字符串)列表:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
analysts = ['Justin Post', 'Some Dude', 'Some Chick']
我需要找到那些名字出现在一个字符串列表,看起来像这样:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.
我之所以需要这样做是为了使我可以连接会话串在一起(由名称隔开)。我怎么会去有效地这样做呢?
我看着一些类似的问题,并试图解决方案没有用,如这样的:
if any(x in str for x in executives):
print('yes')
还有这个...
match = next((x for x in executives if x in str), False)
match
答案 0 :(得分:1)
我不确定这是否是您要寻找的东西
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]
result = [s for s in text if any(ex in s for ex in executives)]
print(result)
输出: ['Brian Olsavsky-Amazon.com']
答案 1 :(得分:1)
"\system\etc\security\cacerts"
此外,如果您需要确切的位置,则可以使用以下位置:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]
executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']
此输出
print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])
答案 2 :(得分:1)
此答案的重点是效率。如果不是关键问题,请使用其他答案。如果是这样,请从您要搜索的语料库中创建一个dict
,然后使用此字典来查找您要寻找的内容。
#import stuff we need later
import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt
首先,我们创建一个要搜索的字符串列表。
使用以下功能创建随机的单词,我的意思是随机字符序列,其长度从Poisson distribution中得出,
def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])
({lam_word
是Poisson distribution的参数。)
让我们从这些单词创建number_of_sentences
变长句子(通过句子我的意思是随机生成的的列表单词(用空格分隔)。
句子的长度也可以从Poisson distribution中得出。
lam_word=5
lam_sentence=1000
number_of_sentences = 10000
sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
for x in range(number_of_sentences)]
sentences[0]
现在将像这样开始:
tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc ...
让我们也创建名称,我们将搜索这些名称。让这些名称为bigrams。 名(即bigram的第一个元素)将是n
个字符,姓氏(第二个bigram元素)将是m
个字符长,它将包含随机字符:
def bigramgen(n,m):
return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
''.join([random.choice(string.ascii_lowercase) for _ in range(m)])
假设我们要查找出现 bigrams (例如ab c
)的句子。我们不想找到dab c
或ab cd
,仅找到ab c
独立的地方。
要测试一种方法有多快,让我们找到数量不断增加的双字母组,并测量经过的时间。我们搜索的二元组的数量可以是,例如:
number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]
只需遍历每个双字母组,遍历每个句子,然后使用in
查找匹配项。同时,measure elapsed time和time.time()
。
bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start = time.time()
for bigram in bigrams:
#the core of the brute force method starts here
reslist=[]
for sentencei, sentence in enumerate(sentences):
if ' '+bigram+' ' in sentence:
reslist.append([bigram,sentencei])
#and ends here
end = time.time()
bruteforcetime.append(end-start)
bruteforcetime
将保留查找10、30、50 ...二元组所需的秒数。
警告:对于大量的双连词,这可能需要很长时间。
让我们为出现在任何句子中的每个单词创建一个空集(使用dict comprehension):
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
对于每个集合,在出现的每个单词中添加index
:
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
请注意,无论以后搜索多少个双字母组,我们只会执行一次。
使用这本字典,我们可以寻找双字的每个部分出现的句子。因为调用了dict value is very fast,所以速度非常快。然后我们take the intersection of these sets。当我们搜索ab c
时,将有一组句子索引,其中ab
和c
都出现。
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in target.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
让我们把整个东西放在一起,并测量经过的时间:
logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start_time=time.time()
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in bigram.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
end_time=time.time()
logtime.append(end_time-start_time)
警告:对于大量的双字母组,这可能会花费很长时间,但比暴力破解方法要短。
我们可以标出每种方法花费的时间。
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
或者在log scale上绘制y axis
:
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
给我们情节:
制作worddict
字典会花费很多时间,并且在搜索少量名称时是不利的。但是有一点很重要,即语料库足够大,我们要搜索的名称数量也足够多,因此与蛮力方法相比,这次可以通过其搜索速度来补偿。因此,如果满足这些条件,我建议使用此方法。
(笔记本电脑here。)