Damerau-Levenshtein距离的有效实施

时间:2014-09-10 13:43:25

标签: performance clojure levenshtein-distance

我正在尝试实现真正高效的Clojure函数来计算Damerau-Levenshtein distance。我已决定使用this algorithm(附加源应该是C ++)来计算Levenshtein距离并添加一些行以使其适用于DLD。

以下是我在Common Lisp中创建的内容(我希望它可以提供帮助):

(defun damerau-levenshtein (x y)
  (declare (type string x y)
           #.*std-opts*)
  (let* ((x-len (length x))
         (y-len (length y))
         (v0 (apply #'vector (mapa-b #'identity 0 y-len)))
         (v1 (make-array (1+ y-len) :element-type 'integer))
         (v* (make-array (1+ y-len) :element-type 'integer)))
    (do ((i 0 (1+ i)))
        ((= i x-len) (aref v0 y-len))
      (setf (aref v1 0) (1+ i))
      (do ((j 0 (1+ j)))
          ((= j y-len))
        (let* ((x-i (char x i))
               (y-j (char y j))
               (cost (if (char-equal x-i y-j) 0 1)))
          (setf (aref v1 (1+ j)) (min (1+ (aref v1 j))
                                      (1+ (aref v0 (1+ j)))
                                      (+  (aref v0 j) cost)))
          (when (and (plusp i) (plusp j))
            (let ((x-i-1 (char x (1- i)))
                  (y-j-1 (char y (1- j)))
                  (val (+ (aref v* (1- j)) cost)))
              (when (and (char-equal x-i y-j-1)
                         (char-equal x-i-1 y-j)
                         (< val (aref v1 (1+ j))))
                (setf (aref v1 (1+ j)) val))))))
      (rotatef v* v0 v1))))

现在,我担心我无法将其翻译成真正高效且惯用的Clojure代码(功能样式?)。我真的很感激任何建议,我认为它对许多未来的读者也很有用。

P.S。我找到了this implementation,但我怀疑它是否有效并且它使用了一些过时的contrib函数(deep-merge-withbool-to-binary):

(defn damerau-levenshtein-distance
  [a b]
  (let [m (count a)
        n (count b)
        init (apply deep-merge-with (fn [a b] b)
                    (concat
                     ;;deletion
                     (for [i (range 0 (+ 1 m))]
                       {i {0 i}})
                     ;;insertion
                     (for [j (range 0 (+ 1 n))]
                       {0 {j j}})))
        table (reduce
               (fn [d [i j]]
                 (deep-merge-with
                  (fn [a b] b)
                  d
                  (let [cost (bool-to-binary (not (= (nth a (- i 1))
                                          (nth b (- j 1)))))
                        x
                          (min
                           (+ ((d (- i 1))
                               j) 1) ;;deletion
                           (+ ((d i)
                               (- j 1)) 1) ;;insertion
                           (+ ((d (- i 1))
                               (- j 1)) cost)) ;;substitution))
                        val (if (and (> i 1)
                               (> j 1)
                               (= (nth a (- i 1))
                                  (nth b (- j 2)))
                               (= (nth a (- i 2))
                                  (nth b (- j 1))))
                        (min x (+ ((d (- i 2))
                                   (- j 2)) ;;transposition
                                  cost))
                        x)]
                    {i {j val}})))
               init
               (for [j (range 1 (+ 1 n))
                     i (range 1 (+ 1 m))] [i j]))]
    ((table m) n)))

2 个答案:

答案 0 :(得分:2)

我最近不得不在clojure中编写一个有效的levenshtein距离函数来计算地面实况文本和ocr引擎结果之间的编辑。 递归实现的性能不足以快速计算两个整页之间的levenshtein距离,因此我的实现使用动态编程。 它不是下载到java 2d-arrays而是使用core.matrix来处理矩阵内容。 添加damerau-levenshtein的转置内容应该不难。

(defn lev [str1 str2]
  (let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2)))
        len1 (count str1) len2 (count str2)]
   (mset! mat 0 0 0)
   (dotimes [i lein1]
     (mset! mat (inc i) 0 (inc i)))
   (dotimes [j len2]
     (mset! mat 0 (inc j) (inc j)))
   (dotimes [dj len2]
     (dotimes [di len1]
       (let [j (inc dj) i (inc di)]
         (mset! mat i j
              (cond
                (= (.charAt ^String str1 di) (.charAt ^String str2 dj))
                (mget mat di dj)
                :else
                (min (inc (mget mat di j)) (inc (mget mat i dj))
                    (inc (mget mat di dj))))))))
   (mget mat len1 len2))))

希望这有帮助

答案 1 :(得分:1)

好的,这应该可以解决问题(基于KIMA's answer):

(defn da-lev [str1 str2]
  (let [l1 (count str1)
        l2 (count str2)
        mx (new-matrix :ndarray (inc l1) (inc l2))]
   (mset! mx 0 0 0)
   (dotimes [i l1]
     (mset! mx (inc i) 0 (inc i)))
   (dotimes [j l2]
     (mset! mx 0 (inc j) (inc j)))
   (dotimes [i l1]
     (dotimes [j l2]
       (let [i+ (inc i) j+ (inc j)
             i- (dec i) j- (dec j)
             cost (if (= (.charAt str1 i)
                         (.charAt str2 j))
                    0 1)]
         (mset! mx i+ j+
                (min (inc (mget mx i j+))
                     (inc (mget mx i+ j))
                     (+ (mget mx i j) cost)))
         (if (and (pos? i) (pos? j)
                  (= (.charAt str1 i)
                     (.charAt str2 j-))
                  (= (.charAt str1 i-)
                     (.charAt str2 j)))
           (mset! mx i+ j+
                  (min (mget mx i+ j+)
                       (+ (mget mx i- j-) cost)))))))
   (mget mx l1 l2)))

请注意,您需要core.matrix库,这不是标准的(尽管它的名称)。可以用Leiningen这样安装它:

[net.mikera/core.matrix "0.29.1"]

该库位于名称空间clojure.core.matrix中。要“按原样”使用此解决方案,您应该将命名空间中的符号添加到命名空间中。