我正在尝试搜索字符串(句子)列表并检查它们是否包含一组特定的子字符串。要做到这一点,我正在使用Python的“任何”功能
sentences = ["I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"]
# Imagine the following lists to contain 1000's of strings
listOfPlaces = ["london", "paris", "san fran"]
listOfTimePhrases = ["tonight", "tomorrow", "week", "monday", "wednesday", "month"]
start = time.time()
sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in pos for x in listOfTimePhrases)]
end = time.time()
print(end-start)
如果你想象我的名单非常庞大,我发现我的两个“任何”陈述所用的时间相当长。对于两个这样的“任何”查询,我大约需要2秒。你知道为什么花了这么长时间,你知道有什么方法可以让代码更快吗?
由于
答案 0 :(得分:4)
不要两次枚举sentences
。你可以在句子上用一个循环进行两次检查。
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sent in enumerate(sentences):
if any(x in sent for x in listOfPlaces):
sntceIdxofPlaces.append(pos)
if any(x in sent for x in listOfTimePhrases):
sntceIdxofTimes.append(pos)
答案 1 :(得分:2)
效率低下。也就是说,你没有使用套装。检查集合中的成员资格是一种非常有效的操作(列表的O(1)vs O(n))。
sentences = ["I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"]
# Imagine the following lists to contain 1000's of strings
listOfPlaces = {"london", "paris", "san fran"}
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}
start = time.time()
sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfTimePhrases)]
end = time.time()
print(end-start)
答案 2 :(得分:1)
以下是使用查找每个句子而不是查找目标短语列表的三种替代方法。所有这些都与目标短语列表的长度严重缩放。
sentences = [
"I am the walrus",
"I am in London",
"I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"
]
sentences *= 1000 # really we want to examine large `listOfPlaces` and `listOfTimePhrases`, but these must be unique and so are harder to generate---this is an quicker dirtier way to test timing scalability
# Imagine the following lists to contain 1000's of strings
listOfPlaces = {"london", "paris", "san fran"}
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}
# preprocess to guard against substring false positives and case-mismatch false negatives:
sentences = [ ' ' + x.lower() + ' ' for x in sentences ]
listOfPlaces = { ' ' + x.lower() + ' ' for x in listOfPlaces }
listOfTimePhrases = { ' ' + x.lower() + ' ' for x in listOfTimePhrases }
#listOfPlaces = list( listOfPlaces )
#listOfTimePhrases = list( listOfTimePhrases )
def foo():
sntceIdxofPlaces = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfTimePhrases)]
return sntceIdxofPlaces, sntceIdxofTimes
def foo2():
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sentence in enumerate(sentences):
if any(x in sentence for x in listOfPlaces): sntceIdxofPlaces.append(pos)
if any(x in sentence for x in listOfTimePhrases): sntceIdxofTimes.append(pos)
return sntceIdxofPlaces, sntceIdxofTimes
def foo3():
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sentence in enumerate(sentences):
for x in listOfPlaces:
if x in sentence: sntceIdxofPlaces.append(pos); break
for x in listOfTimePhrases:
if x in sentence: sntceIdxofTimes.append(pos); break
return sntceIdxofPlaces, sntceIdxofTimes
以下是时间安排结果:
In [171]: timeit foo()
100 loops, best of 3: 15.6 ms per loop
In [172]: timeit foo2()
100 loops, best of 3: 16 ms per loop
In [173]: timeit foo3()
100 loops, best of 3: 8.07 ms per loop
似乎any()
可能效率低下,令我惊讶的是。它可能正在将其输入生成器运行到最后,即使早期找到匹配并且答案已知。我理解它不应该像这样工作,但我无法解释foo2()
和foo3()
之间的运行时间因素-2差异,它们似乎给出了相同的输出。
另外:由于listOfPlaces
和listOfTimePhrases
正在迭代,而不是对成员资格进行测试,因此它们的时间似乎没有变化set
或list
秒。
答案 3 :(得分:0)
使用集合代替listOfPlaces
和listOfTimePhrases
的列表。集合的查找时间要快得多:
listOfPlaces = set(["london", "paris", "san fran"])
listOfTimePhrases = set(["tonight", "tomorrow", "week", "monday", "wednesday", "month"])
答案 4 :(得分:0)
这是一种利用快速查找listOfPlaces
和listOfTimePhrases
作为set
s的解决方案,同时考虑可能的多字短语。即使listOfPlaces
和listOfTimePhrases
包含1000个元素,它也会快速运行。
sentences = [
"I am the walrus",
"I am in London",
"I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"
]
sentences *= 1000
# Imagine the following lists to contain 1000's of strings
placePhrases = {"london", "paris", "san fran"}
timePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}
# preprocess to guard against case-mismatch false negatives:
sentences = [ x.lower() for x in sentences ]
placePhrases = { x.lower() for x in placePhrases }
timePhrases = { x.lower() for x in timePhrases }
# create additional sets in which incomplete phrases can be looked up:
placePrefixes = set()
for phrase in placePhrases:
words = phrase.split()
if len(words) > 1:
for i in range(1, len(words)):
placePrefixes.add(' '.join(words[:i]))
timePrefixes = set()
for phrase in timePhrases:
words = phrase.split()
if len(words) > 1:
for i in range(1, len(words)):
timePrefixes.add(' '.join(words[:i]))
def scan(sentences):
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sentence in enumerate(sentences):
hasPlace = hasTime = False
placePrefix = timePrefix = ''
for word in sentence.split():
if not hasPlace:
placePhrase = placePrefix + (' ' if placePrefix else '') + word
if placePhrase in placePrefixes:
placePrefix = placePhrase
else:
placePrefix = ''
if placePhrase in placePhrases: hasPlace = True
if not hasTime:
timePhrase = timePrefix + (' ' if timePrefix else '') + word
if timePhrase in timePrefixes:
timePrefix = timePhrase
else:
timePrefix = ''
if timePhrase in timePhrases: hasTime = True
if hasTime and hasPlace: break
if hasPlace: sntceIdxofPlaces.append(pos)
if hasTime: sntceIdxofTimes.append(pos)
return sntceIdxofPlaces, sntceIdxofTimes
答案 5 :(得分:0)
我设法使用scikit的countvectorise函数找到了更快的方法
sentences = ["I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"]
listOfPlaces = ["london", "paris", "san fran"]
cv = feature_extraction.text.CountVectorizer(vocabulary=listOfPlaces)
# We now get in the next step a vector of size len(sentences) x len(listOfPlaces)
taggedSentences = cv.fit_transform(sentences).toarray()
aggregateTags = np.sum(taggedSentences, axis=1)
我们最终得到一个大小为len(句子)的向量为1,其中每一行都有一个单词,表示单词列表中的单词出现在每个句子中。
我发现大数据集的结果非常快(如0.02s)