Python中

时间:2016-08-21 16:59:06

标签: python performance fuzzywuzzy

我正在努力解决一些性能问题。 手头的任务是提取两个字符串之间的相似性值。为此我使用fuzzywuzzy

from fuzzywuzzy import fuzz

print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38

然而,这没关系。我面临的问题是我有两个列表,一个有1500行,另一个有几千个。我需要比较第一个元素和第二个元素的所有元素。 for循环中的简单操作需要花费大量的时间来计算。

如果有人有任何建议我怎样才能加快速度,我们将非常感激。

4 个答案:

答案 0 :(得分:1)

如果你需要计算每个语句出现的次数,那么我就不知道如何在比较每个列表中的元素所需的n ^ 2个操作上获得巨大的加速。您可以通过使用长度来排除可能发生匹配但仍然具有嵌套for循环的长度来避免某些字符串匹配。您可能会花费更多时间来优化它,而不是节省您的处理时间。

答案 1 :(得分:1)

我为你自己创造了一些东西(python 2.7):

from __future__ import division

import time
from itertools import izip

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += len(word)
    if counter:
        return counter/len(' '.join(s_bigger))*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

所以我觉得我的速度更快,但对你来说更准确?你决定了。

修改 现在不仅更快,而且更准确。

结果:

match:  41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match:  50
fuzzy: --- 7.39097595215e-05 seconds ---

<simliar or similar>/<length of bigger>*100%
41.1764705882

Equals?
True

Faster than fuzzy?
True

当然,如果你有像fuzzywuzzy那样的单词检查,那么你去吧:

from __future__ import division
from itertools import izip
import time

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += 1
    if counter:
        return counter/len(s_bigger)*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

结果:

match:  50.0
compare: --- 7.20024108887e-05 seconds ---
match:  50
fuzzy: --- 0.000125169754028 seconds ---

Equals?
True

Faster than fuzzy?
True

答案 2 :(得分:0)

我能想到的最佳解决方案是使用IBM Streams framework来并行化您基本上不可避免的O(n ^ 2)解决方案。

使用该框架,您将能够编写类似于此

的单线程内核
def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

然后使用类似于此

的设置对其进行并行化
def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)

这个多线程处理,大大加快了处理速度,同时节省了自己实现多线程细节的工作(这是Streams的主要优势)。

这个想法是每条推文都是一个Streams元组,可以跨多个处理元素进行处理。

Streams的Python拓扑框架文档是hereparallel运算符特别描述为here

答案 3 :(得分:0)

您可以使用column_name.tolist()将列转换为列表并分配给变量。

有一个名为two-lists-similarity的python程序包,用于比较两列的列表并计算得分。

https://pypi.org/project/two-lists-similarity/