检查一个字符串是否包含另一个字符串的某个部分并返回百分比

时间:2019-08-05 10:49:00

标签: javascript

这是用于检查两个字符串(str1str2)的相似性百分比的代码。代码工作正常且完全准确。它会根据两个字符串之间的相似性记录一个介于0到1之间的数字(它会逐字检查相似性)。

因此,如果我们有以下字符串:

  var str1 = "I was sent to earth to protect you"; // user input
  var str2 = "I was sent to earth to protect you"; // reference 

相似性结果将为1

现在,如果我们想将句子的一小部分与参考字符串进行比较该怎么办?

所以,如果我们有这些:

  var str1 = "I was sent to earth"; // user input
  var str2 = "I was sent to earth to protect you"; // reference 

或这些:

  var str1 = "I was sent to earth"; // user input
  var str2 = "to protect you I was sent to earth"; // reference 

预期的相似性结果应为1

这是我的代码:

function checkSimilarity(){
  var str1 = "I was sent to earth";
  var str2 = "I was sent to earth to protect you";
  console.log(similarity(str1, str2));
}

function similarity(s1, s2) {
      var longer = s1;
      var shorter = s2;
      if (s1.length < s2.length) {
        longer = s2;
        shorter = s1;
      }
      var longerLength = longer.length;
      if (longerLength == 0) {
        return 1.0;
      }
      return (longerLength - editDistance(longer, shorter)) / parseFloat(longerLength);
    }

    function editDistance(s1, s2) {
      s1 = s1.toLowerCase();
      s2 = s2.toLowerCase();

      var costs = new Array();
      for (var i = 0; i <= s1.length; i++) {
        var lastValue = i;
        for (var j = 0; j <= s2.length; j++) {
          if (i == 0)
            costs[j] = j;
          else {
            if (j > 0) {
              var newValue = costs[j - 1];
              if (s1.charAt(i - 1) != s2.charAt(j - 1))
                newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
              costs[j - 1] = lastValue;
              lastValue = newValue;
            }
          }
        }
        if (i > 0)
          costs[s2.length] = lastValue;
      }
      return costs[s2.length];
    }
checkSimilarity();

谢谢。

1 个答案:

答案 0 :(得分:0)

我建议使用patienceDiffPlus算法(请参阅https://github.com/jonTrent/PatienceDiff),该算法可用于比较两个相似的字符串数组。通常,此算法用于找出更新的计算机程序中的更改,但是在您的情况下,可以用于比较句子中的单词。具体来说,该算法搜索单词数组之间的最长公共子序列(LCS),并报告插入,删除,可能移动的单词数,并通过计算报告公共序列中相似单词的数量。

在计算平等程度时,根据您的示例,似乎参考文献中没有多余的词与该程度相对应。但是,不知道要与用户输入进行比较的参考句子的全部范围,我建议按照以下内容进行计算:

相似性=(Result.lines.length-Result.lineCountMoved-Result.lineCountDeleted-Result.lineCountInserted)/(Result.lines.length-Result.lineCountMoved)

使用第一个示例...

var str1 = "I was sent to earth"; // user input
var str2 = "I was sent to earth to protect you"; // reference
var compare12 = patienceDiffPlus(str1.split(" "), str2.split(" "));
console.log(compare12);
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}

...表示str2插入了另外3个单词,并且5个单词相等,并且顺序相同。相似度为(8-0-3)/(8-0)或0.625。现在,使用第二个示例...

var str3 = "I was sent to earth"; // user input
var str4 = "to protect you I was sent to earth"; // reference
var compare34 = patienceDiffPlus(str3.split(" "), str4.split(" "));
console.log(compare34)
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}
再次

...表示str4插入了另外3个单词,并且5个单词相等,并且顺序相同。像以前一样,相似度为0.625。现在来看一个更复杂的示例...

var str5 = "I was sent to the earth to protect you";  // user input
var str6 = "to protect you I was sent to planet earth"; // reference
var compare56 = patienceDiffPlus(str5.split(" "), str6.split(" "));
console.log(compare56);
// {lines: Array(13), lineCountDeleted: 1, lineCountInserted: 1, lineCountMoved: 3}
//   lineCountDeleted: 1
//   lineCountInserted: 1
//   lineCountMoved: 3
//   lines: Array(13)
//     0: {line: "to", aIndex: 10, bIndex: 0, moved: true}
//     1: {line: "protect", aIndex: 11, bIndex: 1, moved: true}
//     2: {line: "you", aIndex: 12, bIndex: 2, moved: true}
//     3: {line: "I", aIndex: 0, bIndex: 3}
//     4: {line: "was", aIndex: 1, bIndex: 4}
//     5: {line: "sent", aIndex: 2, bIndex: 5}
//     6: {line: "to", aIndex: 3, bIndex: 6}
//     7: {line: "the", aIndex: 4, bIndex: -1}
//     8: {line: "planet", aIndex: -1, bIndex: 7}
//     9: {line: "earth", aIndex: 5, bIndex: 8}
//     10: {line: "to", aIndex: 6, bIndex: -1, moved: true}
//     11: {line: "protect", aIndex: 7, bIndex: -1, moved: true}
//     12: {line: "you", aIndex: 8, bIndex: -1, moved: true}
//     length: 13

...表示str5相对于str6(“ the”)删除/丢失了一个单词,str6插入了一个单词(“ planet”),并且可能移动了3个单词(“ to”,“ protect”) ,&“ you”)。在这种情况下,相似性度量为(13-3-1-1)/(13-3)或0.800。

假设您打算将一系列参考语句与用户输入进行比较,在这种情况下,您将需要通过patienceDiffPlus算法对所有参考语句运行用户输入,并选择最高相似度。 / p>

也就是说,您将需要根据参考句子对预期的用户输入进行较大的采样,以调整最适合您的应用程序的相似度度量方法。此外,您可能会发现必须删除标点符号,将所有内容设置为小写,删除常见的介词等,才能将用户输入内容归结为基础知识,以帮助进行匹配过程...

希望这会有所帮助。