Levenshtein距离与重量/罚款相邻

时间:2014-05-07 10:56:05

标签: python r levenshtein-distance edit-distance eye-tracking

我正在使用字符串编辑距离(Levenshtein距离)来比较眼动追踪实验的扫描路径。 (现在我正在使用R中的stringdist

基本上,字符串的字母指的是6x4矩阵中的(凝视)位置。矩阵配置如下:

     [,1] [,2] [,3] [,4]
[1,]  'a'  'g'  'm'  's' 
[2,]  'b'  'h'  'n'  't'
[3,]  'c'  'i'  'o'  'u'
[4,]  'd'  'j'  'p'  'v'
[5,]  'e'  'k'  'q'  'w'
[6,]  'f'  'l'  'r'  'x'

如果我使用基本的Levenshtein距离来比较字符串,则字符串中ag的比较会给出与ax的比较图相同的估计值。

E.g:

'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1

这意味着字符串相同(不相似)

我希望能够以一种在矩阵中包含邻接的方式对字符串比较加权。例如。 ax之间的距离应加权为ag之间的距离。

一种方法是计算" walk" (水平和垂直步骤)从矩阵中的一个字母到另一个字母,并除以最大" walk" -distance(即从ax)。例如。 "从ag的距离为1,从ax,它将为8,因此权重为1 /分别为8和1。

有没有办法实现这个(在R或python中)?

4 个答案:

答案 0 :(得分:4)

您需要在其内循环中使用非单位成本的Wagner-Fisher algorithm版本。即通常的算法有+1,使用+del_cost(a[i])等,并定义del_costins_costsub_cost作为带一个或两个符号的函数(可能只是表格)查找)。

答案 1 :(得分:2)

如果有人有同样的问题",这是我的解决方案。我对Kyle Gorman编写的Wagner-Fischer算法的python实现进行了附加。

附加组件是权重函数及其在_dist函数中的实现。

#!/usr/bin/env python
# wagnerfischer.py: Dynamic programming Levensthein distance function 
# Kyle Gorman <gormanky@ohsu.edu>
# 
# Based on:
# 
# Robert A. Wagner and Michael J. Fischer (1974). The string-to-string 
# correction problem. Journal of the ACM 21(1):168-173.
#
# The thresholding function was inspired by BSD-licensed code from 
# Babushka, a Ruby tool by Ben Hoskings and others.
# 
# Unlike many other Levenshtein distance functions out there, this works 
# on arbitrary comparable Python objects, not just strings.


try: # use numpy arrays if possible...
    from numpy import zeros
    def _zeros(*shape):
        """ like this syntax better...a la MATLAB """
        return zeros(shape)

except ImportError: # otherwise do this cute solution
    def _zeros(*shape):
        if len(shape) == 0:
            return 0
        car = shape[0]
        cdr = shape[1:]
        return [_zeros(*cdr) for i in range(car)]

def weight(A,B, weights): 
    if weights == True:
        from numpy import matrix
        from numpy import where
        # cost_weight defines the matrix structure of the AOI-placement
        cost_weight = matrix([["a","b","c","d","e","f"],["g","h","i","j","k","l"],
        ["m","n","o","p","q","r"],["s","t","u","v","w","x"]])

        max_walk = 8.00 # defined as the maximum posible distance between letters in 
                        # the cost_weight matrix

        indexA = where(cost_weight==A)
        indexB = where(cost_weight==B)

        walk = abs(indexA[0][0]-indexB[0][0])+abs(indexA[1][0]-indexB[1][0])

        w = walk/max_walk

        return w
    else:
        return 1

def _dist(A, B, insertion, deletion, substitution, weights=True):
    D = _zeros(len(A) + 1, len(B) + 1)
    for i in xrange(len(A)): 
        D[i + 1][0] = D[i][0] + deletion * weight(A[i],B[0], weights)
    for j in xrange(len(B)): 
        D[0][j + 1] = D[0][j] + insertion * weight(A[0],B[j], weights)
    for i in xrange(len(A)): # fill out middle of matrix
        for j in xrange(len(B)):
            if A[i] == B[j]:
                D[i + 1][j + 1] = D[i][j] # aka, it's free. 
            else:
                D[i + 1][j + 1] = min(D[i + 1][j] + insertion * weight(A[i],B[j], weights),
                                      D[i][j + 1] + deletion * weight(A[i],B[j], weights),
                                      D[i][j]     + substitution * weight(A[i],B[j], weights))
    return D

def _dist_thresh(A, B, thresh, insertion, deletion, substitution):
    D = _zeros(len(A) + 1, len(B) + 1)
    for i in xrange(len(A)):
        D[i + 1][0] = D[i][0] + deletion
    for j in xrange(len(B)): 
        D[0][j + 1] = D[0][j] + insertion
    for i in xrange(len(A)): # fill out middle of matrix
        for j in xrange(len(B)):
            if A[i] == B[j]:
                D[i + 1][j + 1] = D[i][j] # aka, it's free. 
            else:
                D[i + 1][j + 1] = min(D[i + 1][j] + insertion,
                                      D[i][j + 1] + deletion,
                                      D[i][j]     + substitution)
        if min(D[i + 1]) >= thresh:
            return
    return D

def _levenshtein(A, B, insertion, deletion, substitution):
    return _dist(A, B, insertion, deletion, substitution)[len(A)][len(B)]

def _levenshtein_ids(A, B, insertion, deletion, substitution):
    """
    Perform a backtrace to determine the optimal path. This was hard.
    """
    D = _dist(A, B, insertion, deletion, substitution)
    i = len(A) 
    j = len(B)
    ins_c = 0
    del_c = 0
    sub_c = 0
    while True:
        if i > 0:
            if j > 0:
                if D[i - 1][j] <= D[i][j - 1]: # if ins < del
                    if D[i - 1][j] < D[i - 1][j - 1]: # if ins < m/s
                        ins_c += 1
                    else:
                        if D[i][j] != D[i - 1][j - 1]: # if not m
                            sub_c += 1
                        j -= 1
                    i -= 1
                else:
                    if D[i][j - 1] <= D[i - 1][j - 1]: # if del < m/s
                        del_c += 1
                    else:
                        if D[i][j] != D[i - 1][j - 1]: # if not m
                            sub_c += 1
                        i -= 1
                    j -= 1
            else: # only insert
                ins_c += 1
                i -= 1
        elif j > 0: # only delete
            del_c += 1
            j -= 1
        else: 
            return (ins_c, del_c, sub_c)


def _levenshtein_thresh(A, B, thresh, insertion, deletion, substitution):
    D = _dist_thresh(A, B, thresh, insertion, deletion, substitution)
    if D != None:
        return D[len(A)][len(B)]

def levenshtein(A, B, thresh=None, insertion=1, deletion=1, substitution=1):
    """
    Compute levenshtein distance between iterables A and B
    """
    # basic checks
    if len(A) == len(B) and A == B:
        return 0       
    if len(B) > len(A):
        (A, B) = (B, A)
    if len(A) == 0:
        return len(B)
    if thresh:
        if len(A) - len(B) > thresh:
            return
        return _levenshtein_thresh(A, B, thresh, insertion, deletion,
                                                            substitution)
    else: 
        return _levenshtein(A, B, insertion, deletion, substitution)

def levenshtein_ids(A, B, insertion=1, deletion=1, substitution=1):
    """
    Compute number of insertions deletions, and substitutions for an 
    optimal alignment.
    There may be more than one, in which case we disfavor substitution.
    """
    # basic checks
    if len(A) == len(B) and A == B:
        return (0, 0, 0)
    if len(B) > len(A):
        (A, B) = (B, A)
    if len(A) == 0:
        return len(B)
    else: 
        return _levenshtein_ids(A, B, insertion, deletion, substitution)

答案 2 :(得分:0)

查看此库:https://github.com/infoscout/weighted-levenshtein(免责声明:我是作者)。它支持加权Levenshtein距离,加权最佳字符串对齐和加权Damerau-Levenshtein距离。它是用Cython编写的,以获得最佳性能,并且可以通过pip install weighted-levenshtein轻松安装。欢迎提供反馈和拉取请求。

样本用法:

import numpy as np
from weighted_levenshtein import lev


insert_costs = np.ones(128, dtype=np.float64)  # make an array of all 1's of size 128, the number of ASCII characters
insert_costs[ord('D')] = 1.5  # make inserting the character 'D' have cost 1.5 (instead of 1)

# you can just specify the insertion costs
# delete_costs and substitute_costs default to 1 for all characters if unspecified
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs)  # prints '1.5'

答案 3 :(得分:0)

另一个与我无关的权重选项(Python 3.5)是https://github.com/luozhouyang/python-string-similarity

   .issueDataCard{
    margin: 20px;
  }
  /* Structure */
  table {
    width: 100%;
  }
  .mat-sort-header-container {
    align-items: center;
  }
  .mat-form-field {
    font-size: 14px;
    width: 98%;
    margin: 10px;
  }
  @import "var";

  .app-list-name {
    color: $color;
    border-right: 1px solid $theme-divider;
    font-size: 20px;
    line-height: 0px;
    font-weight: 500;
    padding-right: $spacing;
    padding-left: $spacing;
    @include media-xs {
      border-right: none;
    }
  }

  .iconStyle{
    color:#281c7b;
    cursor: pointer;
  }
  .slideToggleStyle{
    display: flex;
    flex-direction: column;
    align-items: flex-end;
  }
  .divMarginStyle{
    margin-bottom: 20px;
    margin-right: 20px;
  }


  .dataStyle{
    text-align: center;
  }
  .dateStyle{
    color:#5D6C7F;
    font-size: 10px;
  }
  .actions{
    padding-right: 30px;
  }

  .action-cell{
    text-align: right;
  }

  .mat-cell{
    align-items: center;
  }

  .mat-elevation-z8 {
    display: flex;
    flex-direction: column;
    max-height: 500px;
    width: 100%;
    margin-bottom: 1.5%;
    overflow: auto;
  }