Question

背景：我有一个列表，其中包含13,000个姓氏记录，其中一些是重复的，我想找出类似的记录来进行手动重复过程。

对于像这样的数组：

 override func prepareForReuse() {
        self.detailsView.isHidden = true
    }

一种有效的获取方式：

["jeff","Jeff","mandy","king","queen"]

说明 [["jeff","Jeff"]]，因为它们的Levenshtein距离为1（可以像3一样可变）。

["jeff","Jeff"]

我想通过Levenshtein距离找到相似度，而不仅仅是大小写相似度

我已经找到fastest Levenshtein implementation之一，但仍然需要35分钟才能得到13000个项目列表的结果。

Answer 1

您的问题不是Levenshtein距离实现的速度。您的问题是您必须将每个单词相互比较。这意味着您进行13000²比较（并每次计算Levenshtein距离）。

所以我的方法是尝试减少比较次数。

以下是一些想法：

只有长度相差小于20％的单词才相似（只是我的估计）
→我们可以按长度分组，仅将单词与其他长度为±20％的单词进行比较
仅当单词共享大量字母时它们才相似
→我们可以创建一个列表，例如3克（均为小写字母），表示它们所包含的词。
→仅将一个单词与其他单词相比较（例如，用Levenshtein距离比较），这些单词具有几个3克的共同点。

Answer 2

删除相似名称的方法：

使用单词的语音表示。 cmudict适用于python nltk。您可以找到哪些名字在语音上彼此接近。
尝试使用不同形式的词干或简化形式。我会尝试像Porter stemmer这样最具侵略性的词干。
Levenshtein特里。您可以创建trie数据结构，这将有助于查找与搜索项之间的距离最短的单词，这在某些搜索引擎中用于全文搜索。据我所知，它已经用Java实现了。对于您的情况，您需要搜索一个项目，然后在每一步将其添加到结构中，您需要确保搜索的项目尚未在结构中。
手动天真方法。查找每个单词/名称的所有合适表示形式，将所有表示形式映射到地图上，并查找包含多个单词的表示形式。如果一个单词有大约15种不同的表示形式，则只需要280K迭代就可以生成该对象（比将每个单词与另一个单词进行比较要快得多，后者需要将80K与13K个名称进行比较）。

-编辑-

如果可以选择的话，我会使用Python或Java代替JS。仅基于以下观点：我不知道所有要求，使用Java / Python进行自然语言处理很常见，任务看起来更像是繁重的数据处理，而不是前端。

Answer 3

在您的工作代码中，您仅使用Levenshtein距离1，因此我认为不需要找到其他距离。

我将提出与乔纳斯·威尔姆斯（Jonas Wilms）发布的类似解决方案，但有以下区别：

无需调用isLevenshtein函数
仅产生唯一的配对
每对按词法排序

// Sample data with lots of similar names
const names = ["Adela","Adelaida","Adelaide","Adele","Adelia","AdeLina","Adeline",
               "Adell","AdellA","Adelle","Ardelia","Ardell","Ardella","Ardelle",
               "Ardis","Madeline","Odelia","ODELL","Odessa","Odette"];

const map = {};
const pairs = new Set;
for (const name of names) {
    for (const i in name+"_") { // Additional iteration to NOT delete a character
        const key = (name.slice(0, i) + name.slice(+i + 1, name.length)).toLowerCase();
        // Group words together where the removal from the same index leads to the same key
        if (!map[key]) map[key] = Array.from({length: key.length+1}, () => new Set);
        // If NO character was removed, put the word in EACH group
        for (const set of (+i < name.length ? [map[key][i]] : map[key])) {
            if (set.has(name)) continue;
            for (let similar of set) pairs.add(JSON.stringify([similar, name].sort()));
            set.add(name);
        }
    }
}
const result = [...pairs].sort().map(JSON.parse); // sort is optional
console.log(result);

我用一组13000个名称（包括至少4000个不同名称）进行了测试，并在0.3秒内产生了8000对。

Answer 4

如果我们在不同位置从“杰夫”中删除一个字符，则最终会导致“ eff”，“ Jff”，“ Jef”和“ Jef”。如果对“ jeff”执行相同操作，则将获得“ eff”，“ jff”，“ Jef”和“ jef”。现在，如果您仔细观察，您会发现两个字符串都产生“ eff”结果，这意味着我们可以创建这些组合到其原始版本的映射，然后为每个字符串生成所有组合，并在地图。通过查找，您将获得相似的结果，例如“ ab c ”和“ c ab”，但它们的levenshtein距离不一定为1，因此我们必须随后进行检查。

现在为什么更好？

迭代所有名称为O（n）（n为单词数），创建所有组合为O（m）（m为单词中平均字符数），在Map中查找为O（ 1），因此它以O（n * m）运行，而您的算法为O（n * n * m），这意味着对于10.000个单词，我的速度要快10.000倍（否则我的计算是错误的：））

  // A "OneToMany" Map
  class MultiMap extends Map {
    set(k, v) {
      if(super.has(k)) {
        super.get(k).push(v);
       } else super.set(k, [v]);
     }
     get(k) {
        return super.get(k) || [];
     }
  }

  function* oneShorter(word) {
    for(let pos = 0; pos < word.length; pos++)
       yield word.substr(0, pos) + word.substr(pos + 1);
  }

  function findDuplicates(names) {
    const combos = new MultiMap();
    const duplicates = [];

    const check = (name, combo) => {
      const dupes = combos.get(combo);
      for(const dupe of dupes) {
         if((isInLevenshteinRange(name, combo, 1))
         duplicates.push([name, dupe]);
      }
      combos.set(combo, name);
    };

    for(const name of names) {
      check(name, name);

      for(const combo of oneShorter(name)) {
         check(name, combo);
      }
    }

     return duplicates;
 }

Answer 5

我有完全不同的方法来解决这个问题，但是我相信我正在介绍一个相当快的方法（但是关于正确/不正确的方法尚有争议）。我的方法是将字符串映射为数值，对这些值进行一次排序，然后一次遍历该列表，将相邻的值相互比较。像这样：

// Test strings (provided by OP) with some additions
var strs = ["Jeff","mandy","jeff","king","queen","joff", "Queen", "jff", "tim", "Timmo", "Tom", "Rob", "Bob"] 

// Function to convert a string into a numeric representation
// to aid with string similarity comparison
function atoi(str, maxLen){
  var i = 0;
  for( var j = 0; j < maxLen; j++ ){
    if( str[j] != null ){
      i += str.toLowerCase().charCodeAt(j)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    } else {
      // Normalize the string with a pad char
      // up to the maxLen (update the value, but don't actually
      // update the string...)
      i += '-'.charCodeAt(0)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    }
  }
  valMap.push({
     str,
     i 
  })
  return i;
}

Number.prototype.inRange = function(min, max){ return(this >= min && this <= max) }

var valMap = []; // Array of string-value pairs

var maxLen = strs.map((s) => s.length).sort().pop() // maxLen of all strings in the array
console.log('maxLen', maxLen)
strs.forEach((s) => atoi(s, maxLen)) // Map strings to values

var similars = [];
var subArr = []
var margin = 0.05;
valMap.sort((a,b) => a.i > b.i ? 1 : -1) // Sort the map...
valMap.forEach((entry, idx) => {  
  if( idx > 0 ){
      var closeness = Math.abs(entry.i / valMap[idx-1].i);
      if( closeness.inRange( 1 - margin, 1 + margin ) ){
        if( subArr.length == 0 ) subArr.push(valMap[idx-1].str)
        subArr.push(entry.str)
        if( idx == valMap.length - 1){
          similars.push(subArr)
        }
      } else {
        if( subArr.length > 0 ) similars.push(subArr)
        subArr = []
      }
  }
})
console.log('similars', similars)

我将每个字符串都当作一个“ 64位数字”对待，其中每个“位”都可以采用字母数字值，“ a”代表0。然后我对进行一次排序。然后，如果遇到与前一个相似的值（即，如果两者之比接近1），则推断出我有相似的字符串。

我要做的另一件事是检查最大字符串长度，并在计算“ 64位值”时将所有字符串标准化为该长度。

---编辑：甚至更多的压力测试- 但是，这里还有一些其他测试，可以提取大量名称并相当快速地执行处理（20k +个名称上约50ms，带有很多误报）。无论如何，此代码段都可以使故障排除更加容易：

var valMap = []; // Array of string-value pairs

/* Extensions */
Number.prototype.inRange = function(min, max){ return(this >= min && this <= max) }

/* Methods */
// Function to convert a string into a numeric representation
// to aid with string similarity comparison
function atoi(str, maxLen){
  var i = 0;
  for( var j = 0; j < maxLen; j++ ){
    if( str[j] != null ){
      i += str.toLowerCase().charCodeAt(j)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    } else {
      // Normalize the string with a pad char
      // up to the maxLen (update the value, but don't actually
      // update the string...)
      i += '-'.charCodeAt(0)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    }
  }
  valMap.push({ str, i })
  return i;
}

function findSimilars(strs){
  var maxLen = strs.map((s) => s.length).sort().pop() // maxLen of all strings in the array
  console.log('maxLen', maxLen)
  strs.forEach((s) => atoi(s, maxLen)) // Map strings to values

  var similars = [];
  var subArr = []
  var margin = 0.05;
  valMap.sort((a,b) => a.i > b.i ? 1 : -1) // Sort the map...
  valMap.forEach((entry, idx) => {  
    if( idx > 0 ){
        var closeness = Math.abs(entry.i / valMap[idx-1].i);
        if( closeness.inRange( 1 - margin, 1 + margin ) ){
          if( subArr.length == 0 ) subArr.push(valMap[idx-1].str)
          subArr.push(entry.str)
          if( idx == valMap.length - 1){
            similars.push(subArr)
          }
        } else {
          if( subArr.length > 0 ) similars.push(subArr)
          subArr = []
        }
    }
  })
  console.log('similars', similars)
}

// Stress test with 20k+ names 
$.get('https://raw.githubusercontent.com/dominictarr/random-name/master/names.json')
.then((resp) => {
  var strs = JSON.parse(resp);
  console.time('processing')
  findSimilars(strs)
  console.timeEnd('processing')
})
.catch((err) => { console.err('Err retrieving JSON'); })

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

（由于某种原因，当我在JSFiddle中运行它时，它可以在大约50ms内运行，但是在Stackoverflow片段中，它接近1000ms。）

如何有效地在JavaScript中的唯一字符串中找到相似的字符串？

5 个答案: