应用错误收集

python中计算规范化文档向量的内（点）乘积的最快方法是什么？

时间：2015-05-13 13:02:03

标签： python pandas

字典形式的文档向量是这样的。

{ 'ABCD'：0.4531， 'hhks'：0.08763， 'djlkl'：9843 }

矢量的长度可以变化。我试过熊猫系列。但是我已经看到，在较小的向量中，pandas比字典实现慢大约100倍。有没有更好的方法来做到这一点？

使用词典的代码

d1的长度始终小于d2的长度

def cosine_smaller_larger(d1, d2):
    s = 0.0
    for key in d1.keys():
        if key in d2:
            s += d1[key] * d2[key]
    return s

使用pandas的代码

from pandas import Series

def seriesmult(s1, s2):
    return s1.mul(s2, fill_value=0).sum()

def cosine_smaller_larger(d1, d2):
    s1 = Series(d1)
    s2 = Series(d2)
    return seriesmult(s1, s2)

用于分析上述两种方法的代码（我忽略了创建pandas系列所花费的时间）

def cosine_smaller_larger_comparison(d1, d2):
    t1 = time.time()
    s = 0.0
    for key in d1.keys():
        if key in d2:
            s += d1[key] * d2[key]
    t2 = time.time()
    t3 = t2 - t1

    s1 = Series(d1)
    s2 = Series(d2)
    t1 = time.time()
    ss = s1.mul(s2, fill_value=0).sum()
    t2 = time.time()
    t4 = t2 - t1
    try:
        t5 = t4 / t3
    except:
        t5 = 'division by zero'
    intersection = set(d1.keys()) & set(d2.keys())
    num_mult = len(intersection)
    ld1 = len(d1)
    ld2 = len(d2)
    output = "L1 = {}, L2 = {}, Mults = {}, PT<DT? = {}, PT = {}, DT = {}, PT/DT = {}".format(ld1, ld2, num_mult, t4 < t3, t4, t3, t5)
    print output

    return s

<案例1：大矢量（L1> 1000）

我将cosine_smaller_larger_comparison的输出转换为pandas数据帧，以检查大型矢量的行为。

L1 = length of first vector
L2 = length of the second vector
Mults = number of non zero multiplictions
PT = time taken by pandas
DT = time taken by dictionary implementation
PTdivDT = the factor by which dictionary beats pandas
PTltDT=Was Pandas faster than dictionary for this particular vector



(Pdb) df1.loc[df1['L1']>1000][:10]
              DT    L1     L2  Mults        PT   PTdivDT PTltDT
64002   0.000145  1064   1361    151  0.001333  9.195724  False
64308   0.000168  1064   1853    178  0.001125  6.692199  False
64362   0.000197  1044   1064    148  0.001260  6.397094  False
108372  0.000180  1018   1064    167  0.001298  7.210596  False
113457  0.001332  3141   9644   3141  0.003576  2.685106  False
113458  0.002342  3886   9083   3886  0.004181  1.785198  False
113583  0.002099  3435   9644   3433  0.003591  1.710813  False
113584  0.002662  4101   9083   4095  0.003828  1.437937  False
113592  0.000887  1853  19674   1850  0.005778  6.514785  False
113619  0.002480  3198   9644   3193  0.003207  1.293337  False

这里的字典实现击败了熊猫系列，但边距较小。

案例2：较小的载体

以下是一些大熊猫速度超过100倍的输入尺寸。

(Pdb) df1.loc[df1['PTdivDT']>100][:30]
          DT  L1   L2  Mults        PT     PTdivDT PTltDT
0   0.000002   3    3      0  0.001242  651.250000  False
1   0.000002   3    3      0  0.000558  292.625000  False
6   0.000003   3    4      1  0.000341  110.000000  False
8   0.000001   0    0      0  0.000106  111.000000  False
10  0.000001   0   30      0  0.000362  379.750000  False
18  0.000001   1    3      0  0.000339  284.200000  False
19  0.000000   1    3      0  0.000341         inf  False
24  0.000001   1    3      0  0.000381  399.500000  False
26  0.000000   0    0      0  0.000103         inf  False
28  0.000003  29   30      0  0.000399  128.769231  False
31  0.000004  12   20      5  0.000409  100.941176  False
32  0.000003   8  156      4  0.000377  121.615385  False
33  0.000002  11  369      0  0.000410  214.875000  False
34  0.000002   1    1      1  0.000202  105.875000  False
35  0.000003   2   60      2  0.000349  112.615385  False
36  0.000001   1    3      0  0.000335  351.250000  False
37  0.000001   1    3      0  0.000325  272.600000  False
39  0.000003  17   32      2  0.000389  136.000000  False
41  0.000003  11   18      4  0.000386  124.538462  False
42  0.000001   3    5      0  0.000332  348.250000  False
44  0.000001   0    0      0  0.000102  107.000000  False
46  0.000004  30   42      0  0.000471  116.235294  False
51  0.000010  59  369      2  0.001014  101.261905  False
54  0.000001   1    3      0  0.000518  543.250000  False
55  0.000001   1    3      0  0.000526  551.750000  False
57  0.000004  11   32      2  0.000461  113.705882  False
60  0.000001   1    3      0  0.000660  692.250000  False
62  0.000001   0    2      0  0.000293  307.000000  False
64  0.000003  26   30      0  0.000343  110.692308  False
65  0.000002   1    1      1  0.000223  116.875000  False

0 个答案:

没有答案