字典形式的文档向量是这样的。
{ 'ABCD':0.4531, 'hhks':0.08763, 'djlkl':9843 }
矢量的长度可以变化。 我试过熊猫系列。 但是我已经看到,在较小的向量中,pandas比字典实现慢大约100倍。 有没有更好的方法来做到这一点?
d1的长度始终小于d2的长度
def cosine_smaller_larger(d1, d2):
s = 0.0
for key in d1.keys():
if key in d2:
s += d1[key] * d2[key]
return s
from pandas import Series
def seriesmult(s1, s2):
return s1.mul(s2, fill_value=0).sum()
def cosine_smaller_larger(d1, d2):
s1 = Series(d1)
s2 = Series(d2)
return seriesmult(s1, s2)
def cosine_smaller_larger_comparison(d1, d2):
t1 = time.time()
s = 0.0
for key in d1.keys():
if key in d2:
s += d1[key] * d2[key]
t2 = time.time()
t3 = t2 - t1
s1 = Series(d1)
s2 = Series(d2)
t1 = time.time()
ss = s1.mul(s2, fill_value=0).sum()
t2 = time.time()
t4 = t2 - t1
try:
t5 = t4 / t3
except:
t5 = 'division by zero'
intersection = set(d1.keys()) & set(d2.keys())
num_mult = len(intersection)
ld1 = len(d1)
ld2 = len(d2)
output = "L1 = {}, L2 = {}, Mults = {}, PT<DT? = {}, PT = {}, DT = {}, PT/DT = {}".format(ld1, ld2, num_mult, t4 < t3, t4, t3, t5)
print output
return s
<案例1:大矢量(L1> 1000)
我将cosine_smaller_larger_comparison的输出转换为pandas数据帧,以检查大型矢量的行为。
L1 = length of first vector
L2 = length of the second vector
Mults = number of non zero multiplictions
PT = time taken by pandas
DT = time taken by dictionary implementation
PTdivDT = the factor by which dictionary beats pandas
PTltDT=Was Pandas faster than dictionary for this particular vector
(Pdb) df1.loc[df1['L1']>1000][:10]
DT L1 L2 Mults PT PTdivDT PTltDT
64002 0.000145 1064 1361 151 0.001333 9.195724 False
64308 0.000168 1064 1853 178 0.001125 6.692199 False
64362 0.000197 1044 1064 148 0.001260 6.397094 False
108372 0.000180 1018 1064 167 0.001298 7.210596 False
113457 0.001332 3141 9644 3141 0.003576 2.685106 False
113458 0.002342 3886 9083 3886 0.004181 1.785198 False
113583 0.002099 3435 9644 3433 0.003591 1.710813 False
113584 0.002662 4101 9083 4095 0.003828 1.437937 False
113592 0.000887 1853 19674 1850 0.005778 6.514785 False
113619 0.002480 3198 9644 3193 0.003207 1.293337 False
这里的字典实现击败了熊猫系列,但边距较小。
以下是一些大熊猫速度超过100倍的输入尺寸。
(Pdb) df1.loc[df1['PTdivDT']>100][:30]
DT L1 L2 Mults PT PTdivDT PTltDT
0 0.000002 3 3 0 0.001242 651.250000 False
1 0.000002 3 3 0 0.000558 292.625000 False
6 0.000003 3 4 1 0.000341 110.000000 False
8 0.000001 0 0 0 0.000106 111.000000 False
10 0.000001 0 30 0 0.000362 379.750000 False
18 0.000001 1 3 0 0.000339 284.200000 False
19 0.000000 1 3 0 0.000341 inf False
24 0.000001 1 3 0 0.000381 399.500000 False
26 0.000000 0 0 0 0.000103 inf False
28 0.000003 29 30 0 0.000399 128.769231 False
31 0.000004 12 20 5 0.000409 100.941176 False
32 0.000003 8 156 4 0.000377 121.615385 False
33 0.000002 11 369 0 0.000410 214.875000 False
34 0.000002 1 1 1 0.000202 105.875000 False
35 0.000003 2 60 2 0.000349 112.615385 False
36 0.000001 1 3 0 0.000335 351.250000 False
37 0.000001 1 3 0 0.000325 272.600000 False
39 0.000003 17 32 2 0.000389 136.000000 False
41 0.000003 11 18 4 0.000386 124.538462 False
42 0.000001 3 5 0 0.000332 348.250000 False
44 0.000001 0 0 0 0.000102 107.000000 False
46 0.000004 30 42 0 0.000471 116.235294 False
51 0.000010 59 369 2 0.001014 101.261905 False
54 0.000001 1 3 0 0.000518 543.250000 False
55 0.000001 1 3 0 0.000526 551.750000 False
57 0.000004 11 32 2 0.000461 113.705882 False
60 0.000001 1 3 0 0.000660 692.250000 False
62 0.000001 0 2 0 0.000293 307.000000 False
64 0.000003 26 30 0 0.000343 110.692308 False
65 0.000002 1 1 1 0.000223 116.875000 False