我的df如下:
ALTER PROCEDURE spTicketTimeUpdate
as
begin
declare @end datetime
set @end = getdate()
update TicketTb set RunTimeClose = t1.rt ,
TimeRun = CASE
WHERE TicketTb.TicketStatus = 'Open' Then t1.rt
ELSE TimeRun
END CASE
from
(select CONVERT(VARCHAR(10), ( DATEDIFF(s, DateAndTime, @end) / 86400 )) + ' Day(s) '
+ CONVERT(VARCHAR(10), ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) / 3600 )) + ' Hr(s) '
+ CONVERT(VARCHAR(10), ( ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) % 3600 ) / 60 ))
+ ' Min(s) ' + CONVERT(VARCHAR(10), ( ( ( DATEDIFF(s, DateAndTime, @end) % 86400 ) % 3600 ) % 60 ))
+ ' Sec(s)' as rt, TicketNumber from TicketTb
) as t1
where t1.TicketNumber = TicketTb.TicketNumber
Close
我正在计算每个字符串之间的距离。例如,要获取前两个字符串之间的距离:0 111155555511111116666611111111
1 555555111111111116666611222222
2 221111114444411111111777777777
3 111111116666666661111111111111
.......
1000 114444111111111111555555111111
。这将返回一个整数。
现在,我想创建一个df来存储每个字符串之间的所有距离。在这种情况下,由于我有1000个字符串,所以我会有1000 x 1000 df。第一个值是字符串1和字符串本身之间的距离,然后是字符串1和字符串2,依此类推。然后在下一行中,它的字符串2和string1,字符串2及其本身等等。
答案 0 :(得分:2)
创建值Series
的所有组合,并在列表中获得hamming
的距离,然后转换为数组并为DataFrame
整形:
import textdistance
from itertools import product
L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
0 1 2 3 4
0 0 14 24 18 15
1 14 0 24 26 19
2 24 24 0 20 23
3 18 26 20 0 19
4 15 19 23 19 0
编辑:
要提高性能,请使用具有更改的lambda函数的this解决方案:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1))
transformed_strings = np.array(df).reshape(-1,1)
# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))
# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
0 1 2 3 4
0 0 14 24 18 15
1 14 0 24 26 19
2 24 24 0 20 23
3 18 26 20 0 19
4 15 19 23 19 0