我正在使用字符串编辑距离(Levenshtein距离)来比较眼动追踪实验的扫描路径。 (现在我正在使用R中的stringdist
包
基本上,字符串的字母指的是6x4矩阵中的(凝视)位置。矩阵配置如下:
[,1] [,2] [,3] [,4]
[1,] 'a' 'g' 'm' 's'
[2,] 'b' 'h' 'n' 't'
[3,] 'c' 'i' 'o' 'u'
[4,] 'd' 'j' 'p' 'v'
[5,] 'e' 'k' 'q' 'w'
[6,] 'f' 'l' 'r' 'x'
如果我使用基本的Levenshtein距离来比较字符串,则字符串中a
和g
的比较会给出与a
和x
的比较图相同的估计值。
E.g:
'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1
这意味着字符串相同(不相似)
我希望能够以一种在矩阵中包含邻接的方式对字符串比较加权。例如。 a
和x
之间的距离应加权为a
和g
之间的距离。
一种方法是计算" walk" (水平和垂直步骤)从矩阵中的一个字母到另一个字母,并除以最大" walk" -distance(即从a
到x
)。例如。 "从a
到g
的距离为1,从a
到x
,它将为8,因此权重为1 /分别为8和1。
有没有办法实现这个(在R或python中)?
答案 0 :(得分:4)
您需要在其内循环中使用非单位成本的Wagner-Fisher algorithm版本。即通常的算法有+1
,使用+del_cost(a[i])
等,并定义del_cost
,ins_cost
和sub_cost
作为带一个或两个符号的函数(可能只是表格)查找)。
答案 1 :(得分:2)
如果有人有同样的问题",这是我的解决方案。我对Kyle Gorman编写的Wagner-Fischer算法的python实现进行了附加。
附加组件是权重函数及其在_dist函数中的实现。
#!/usr/bin/env python
# wagnerfischer.py: Dynamic programming Levensthein distance function
# Kyle Gorman <gormanky@ohsu.edu>
#
# Based on:
#
# Robert A. Wagner and Michael J. Fischer (1974). The string-to-string
# correction problem. Journal of the ACM 21(1):168-173.
#
# The thresholding function was inspired by BSD-licensed code from
# Babushka, a Ruby tool by Ben Hoskings and others.
#
# Unlike many other Levenshtein distance functions out there, this works
# on arbitrary comparable Python objects, not just strings.
try: # use numpy arrays if possible...
from numpy import zeros
def _zeros(*shape):
""" like this syntax better...a la MATLAB """
return zeros(shape)
except ImportError: # otherwise do this cute solution
def _zeros(*shape):
if len(shape) == 0:
return 0
car = shape[0]
cdr = shape[1:]
return [_zeros(*cdr) for i in range(car)]
def weight(A,B, weights):
if weights == True:
from numpy import matrix
from numpy import where
# cost_weight defines the matrix structure of the AOI-placement
cost_weight = matrix([["a","b","c","d","e","f"],["g","h","i","j","k","l"],
["m","n","o","p","q","r"],["s","t","u","v","w","x"]])
max_walk = 8.00 # defined as the maximum posible distance between letters in
# the cost_weight matrix
indexA = where(cost_weight==A)
indexB = where(cost_weight==B)
walk = abs(indexA[0][0]-indexB[0][0])+abs(indexA[1][0]-indexB[1][0])
w = walk/max_walk
return w
else:
return 1
def _dist(A, B, insertion, deletion, substitution, weights=True):
D = _zeros(len(A) + 1, len(B) + 1)
for i in xrange(len(A)):
D[i + 1][0] = D[i][0] + deletion * weight(A[i],B[0], weights)
for j in xrange(len(B)):
D[0][j + 1] = D[0][j] + insertion * weight(A[0],B[j], weights)
for i in xrange(len(A)): # fill out middle of matrix
for j in xrange(len(B)):
if A[i] == B[j]:
D[i + 1][j + 1] = D[i][j] # aka, it's free.
else:
D[i + 1][j + 1] = min(D[i + 1][j] + insertion * weight(A[i],B[j], weights),
D[i][j + 1] + deletion * weight(A[i],B[j], weights),
D[i][j] + substitution * weight(A[i],B[j], weights))
return D
def _dist_thresh(A, B, thresh, insertion, deletion, substitution):
D = _zeros(len(A) + 1, len(B) + 1)
for i in xrange(len(A)):
D[i + 1][0] = D[i][0] + deletion
for j in xrange(len(B)):
D[0][j + 1] = D[0][j] + insertion
for i in xrange(len(A)): # fill out middle of matrix
for j in xrange(len(B)):
if A[i] == B[j]:
D[i + 1][j + 1] = D[i][j] # aka, it's free.
else:
D[i + 1][j + 1] = min(D[i + 1][j] + insertion,
D[i][j + 1] + deletion,
D[i][j] + substitution)
if min(D[i + 1]) >= thresh:
return
return D
def _levenshtein(A, B, insertion, deletion, substitution):
return _dist(A, B, insertion, deletion, substitution)[len(A)][len(B)]
def _levenshtein_ids(A, B, insertion, deletion, substitution):
"""
Perform a backtrace to determine the optimal path. This was hard.
"""
D = _dist(A, B, insertion, deletion, substitution)
i = len(A)
j = len(B)
ins_c = 0
del_c = 0
sub_c = 0
while True:
if i > 0:
if j > 0:
if D[i - 1][j] <= D[i][j - 1]: # if ins < del
if D[i - 1][j] < D[i - 1][j - 1]: # if ins < m/s
ins_c += 1
else:
if D[i][j] != D[i - 1][j - 1]: # if not m
sub_c += 1
j -= 1
i -= 1
else:
if D[i][j - 1] <= D[i - 1][j - 1]: # if del < m/s
del_c += 1
else:
if D[i][j] != D[i - 1][j - 1]: # if not m
sub_c += 1
i -= 1
j -= 1
else: # only insert
ins_c += 1
i -= 1
elif j > 0: # only delete
del_c += 1
j -= 1
else:
return (ins_c, del_c, sub_c)
def _levenshtein_thresh(A, B, thresh, insertion, deletion, substitution):
D = _dist_thresh(A, B, thresh, insertion, deletion, substitution)
if D != None:
return D[len(A)][len(B)]
def levenshtein(A, B, thresh=None, insertion=1, deletion=1, substitution=1):
"""
Compute levenshtein distance between iterables A and B
"""
# basic checks
if len(A) == len(B) and A == B:
return 0
if len(B) > len(A):
(A, B) = (B, A)
if len(A) == 0:
return len(B)
if thresh:
if len(A) - len(B) > thresh:
return
return _levenshtein_thresh(A, B, thresh, insertion, deletion,
substitution)
else:
return _levenshtein(A, B, insertion, deletion, substitution)
def levenshtein_ids(A, B, insertion=1, deletion=1, substitution=1):
"""
Compute number of insertions deletions, and substitutions for an
optimal alignment.
There may be more than one, in which case we disfavor substitution.
"""
# basic checks
if len(A) == len(B) and A == B:
return (0, 0, 0)
if len(B) > len(A):
(A, B) = (B, A)
if len(A) == 0:
return len(B)
else:
return _levenshtein_ids(A, B, insertion, deletion, substitution)
答案 2 :(得分:0)
查看此库:https://github.com/infoscout/weighted-levenshtein(免责声明:我是作者)。它支持加权Levenshtein距离,加权最佳字符串对齐和加权Damerau-Levenshtein距离。它是用Cython编写的,以获得最佳性能,并且可以通过pip install weighted-levenshtein
轻松安装。欢迎提供反馈和拉取请求。
样本用法:
import numpy as np
from weighted_levenshtein import lev
insert_costs = np.ones(128, dtype=np.float64) # make an array of all 1's of size 128, the number of ASCII characters
insert_costs[ord('D')] = 1.5 # make inserting the character 'D' have cost 1.5 (instead of 1)
# you can just specify the insertion costs
# delete_costs and substitute_costs default to 1 for all characters if unspecified
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs) # prints '1.5'
答案 3 :(得分:0)
另一个与我无关的权重选项(Python 3.5)是https://github.com/luozhouyang/python-string-similarity
.issueDataCard{
margin: 20px;
}
/* Structure */
table {
width: 100%;
}
.mat-sort-header-container {
align-items: center;
}
.mat-form-field {
font-size: 14px;
width: 98%;
margin: 10px;
}
@import "var";
.app-list-name {
color: $color;
border-right: 1px solid $theme-divider;
font-size: 20px;
line-height: 0px;
font-weight: 500;
padding-right: $spacing;
padding-left: $spacing;
@include media-xs {
border-right: none;
}
}
.iconStyle{
color:#281c7b;
cursor: pointer;
}
.slideToggleStyle{
display: flex;
flex-direction: column;
align-items: flex-end;
}
.divMarginStyle{
margin-bottom: 20px;
margin-right: 20px;
}
.dataStyle{
text-align: center;
}
.dateStyle{
color:#5D6C7F;
font-size: 10px;
}
.actions{
padding-right: 30px;
}
.action-cell{
text-align: right;
}
.mat-cell{
align-items: center;
}
.mat-elevation-z8 {
display: flex;
flex-direction: column;
max-height: 500px;
width: 100%;
margin-bottom: 1.5%;
overflow: auto;
}