如何构造距离或差异矩阵?

时间:2019-09-09 04:06:02

标签: python pandas distance-matrix

我的df如下:

ALTER PROCEDURE spTicketTimeUpdate

as
begin

declare @end datetime
set @end = getdate()


update TicketTb set RunTimeClose = t1.rt ,
                    TimeRun = CASE
                                 WHERE TicketTb.TicketStatus = 'Open' Then t1.rt
                                 ELSE  TimeRun
                               END CASE
from
    (select CONVERT(VARCHAR(10), ( DATEDIFF(s, DateAndTime, @end) / 86400 )) + ' Day(s) '
        + CONVERT(VARCHAR(10), ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) / 3600 )) + ' Hr(s) '
        + CONVERT(VARCHAR(10), ( ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) % 3600 ) / 60 ))
        + ' Min(s) ' + CONVERT(VARCHAR(10), ( ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) % 3600 ) % 60 ))
        + ' Sec(s)' as rt, TicketNumber from TicketTb
            ) as t1
where t1.TicketNumber = TicketTb.TicketNumber

Close

我正在计算每个字符串之间的距离。例如,要获取前两个字符串之间的距离:0 111155555511111116666611111111 1 555555111111111116666611222222 2 221111114444411111111777777777 3 111111116666666661111111111111 ....... 1000 114444111111111111555555111111 。这将返回一个整数。

现在,我想创建一个df来存储每个字符串之间的所有距离。在这种情况下,由于我有1000个字符串,所以我会有1000 x 1000 df。第一个值是字符串1和字符串本身之间的距离,然后是字符串1和字符串2,依此类推。然后在下一行中,它的字符串2和string1,字符串2及其本身等等。

1 个答案:

答案 0 :(得分:2)

创建值Series的所有组合,并在列表中获得hamming的距离,然后转换为数组并为DataFrame整形:

import textdistance
from  itertools import product

L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

编辑:

要提高性能,请使用具有更改的lambda函数的this解决方案:

import numpy as np    
from scipy.spatial.distance import pdist, squareform

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(df).reshape(-1,1)

# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))

# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0