我正在努力解决一些性能问题。
手头的任务是提取两个字符串之间的相似性值。为此我使用fuzzywuzzy
:
from fuzzywuzzy import fuzz
print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38
然而,这没关系。我面临的问题是我有两个列表,一个有1500行,另一个有几千个。我需要比较第一个元素和第二个元素的所有元素。 for循环中的简单操作需要花费大量的时间来计算。
如果有人有任何建议我怎样才能加快速度,我们将非常感激。
答案 0 :(得分:1)
如果你需要计算每个语句出现的次数,那么我就不知道如何在比较每个列表中的元素所需的n ^ 2个操作上获得巨大的加速。您可以通过使用长度来排除可能发生匹配但仍然具有嵌套for循环的长度来避免某些字符串匹配。您可能会花费更多时间来优化它,而不是节省您的处理时间。
答案 1 :(得分:1)
我为你自己创造了一些东西(python 2.7):
from __future__ import division
import time
from itertools import izip
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += len(word)
if counter:
return counter/len(' '.join(s_bigger))*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
所以我觉得我的速度更快,但对你来说更准确?你决定了。
修改强> 现在不仅更快,而且更准确。
结果:
match: 41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match: 50
fuzzy: --- 7.39097595215e-05 seconds ---
<simliar or similar>/<length of bigger>*100%
41.1764705882
Equals?
True
Faster than fuzzy?
True
当然,如果你有像fuzzywuzzy那样的单词检查,那么你去吧:
from __future__ import division
from itertools import izip
import time
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += 1
if counter:
return counter/len(s_bigger)*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
结果:
match: 50.0
compare: --- 7.20024108887e-05 seconds ---
match: 50
fuzzy: --- 0.000125169754028 seconds ---
Equals?
True
Faster than fuzzy?
True
答案 2 :(得分:0)
我能想到的最佳解决方案是使用IBM Streams framework来并行化您基本上不可避免的O(n ^ 2)解决方案。
使用该框架,您将能够编写类似于此
的单线程内核def matchStatements(tweet, statements):
results = []
for s in statements:
r = fuzz.ratio(tweet, s)
results.append(r)
return results
然后使用类似于此
的设置对其进行并行化def main():
topo = Topology("tweet_compare")
source = topo.source(getTweets)
cpuCores = 4
match = source.parallel(cpuCores).transform(matchStatements)
end = match.end_parallel()
end.sink(print)
这个多线程处理,大大加快了处理速度,同时节省了自己实现多线程细节的工作(这是Streams的主要优势)。
这个想法是每条推文都是一个Streams元组,可以跨多个处理元素进行处理。
答案 3 :(得分:0)
您可以使用column_name.tolist()
将列转换为列表并分配给变量。
有一个名为two-lists-similarity
的python程序包,用于比较两列的列表并计算得分。