Question

这是用于检查两个字符串（str1和str2）的相似性百分比的代码。代码工作正常且完全准确。它会根据两个字符串之间的相似性记录一个介于0到1之间的数字（它会逐字检查相似性）。

因此，如果我们有以下字符串：

  var str1 = "I was sent to earth to protect you"; // user input
  var str2 = "I was sent to earth to protect you"; // reference

相似性结果将为1。

现在，如果我们想将句子的一小部分与参考字符串进行比较该怎么办？

所以，如果我们有这些：

  var str1 = "I was sent to earth"; // user input
  var str2 = "I was sent to earth to protect you"; // reference

或这些：

  var str1 = "I was sent to earth"; // user input
  var str2 = "to protect you I was sent to earth"; // reference

预期的相似性结果应为1。

这是我的代码：

function checkSimilarity(){
  var str1 = "I was sent to earth";
  var str2 = "I was sent to earth to protect you";
  console.log(similarity(str1, str2));
}

function similarity(s1, s2) {
      var longer = s1;
      var shorter = s2;
      if (s1.length < s2.length) {
        longer = s2;
        shorter = s1;
      }
      var longerLength = longer.length;
      if (longerLength == 0) {
        return 1.0;
      }
      return (longerLength - editDistance(longer, shorter)) / parseFloat(longerLength);
    }

    function editDistance(s1, s2) {
      s1 = s1.toLowerCase();
      s2 = s2.toLowerCase();

      var costs = new Array();
      for (var i = 0; i <= s1.length; i++) {
        var lastValue = i;
        for (var j = 0; j <= s2.length; j++) {
          if (i == 0)
            costs[j] = j;
          else {
            if (j > 0) {
              var newValue = costs[j - 1];
              if (s1.charAt(i - 1) != s2.charAt(j - 1))
                newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
              costs[j - 1] = lastValue;
              lastValue = newValue;
            }
          }
        }
        if (i > 0)
          costs[s2.length] = lastValue;
      }
      return costs[s2.length];
    }
checkSimilarity();

谢谢。

Answer 1

我建议使用patienceDiffPlus算法（请参阅https://github.com/jonTrent/PatienceDiff），该算法可用于比较两个相似的字符串数组。通常，此算法用于找出更新的计算机程序中的更改，但是在您的情况下，可以用于比较句子中的单词。具体来说，该算法搜索单词数组之间的最长公共子序列（LCS），并报告插入，删除，可能移动的单词数，并通过计算报告公共序列中相似单词的数量。

在计算平等程度时，根据您的示例，似乎参考文献中没有多余的词与该程度相对应。但是，不知道要与用户输入进行比较的参考句子的全部范围，我建议按照以下内容进行计算：

相似性=（Result.lines.length-Result.lineCountMoved-Result.lineCountDeleted-Result.lineCountInserted）/（Result.lines.length-Result.lineCountMoved）

使用第一个示例...

var str1 = "I was sent to earth"; // user input
var str2 = "I was sent to earth to protect you"; // reference
var compare12 = patienceDiffPlus(str1.split(" "), str2.split(" "));
console.log(compare12);
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}

...表示str2插入了另外3个单词，并且5个单词相等，并且顺序相同。相似度为（8-0-3）/（8-0）或0.625。现在，使用第二个示例...

var str3 = "I was sent to earth"; // user input
var str4 = "to protect you I was sent to earth"; // reference
var compare34 = patienceDiffPlus(str3.split(" "), str4.split(" "));
console.log(compare34)
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}

再次

...表示str4插入了另外3个单词，并且5个单词相等，并且顺序相同。像以前一样，相似度为0.625。现在来看一个更复杂的示例...

var str5 = "I was sent to the earth to protect you";  // user input
var str6 = "to protect you I was sent to planet earth"; // reference
var compare56 = patienceDiffPlus(str5.split(" "), str6.split(" "));
console.log(compare56);
// {lines: Array(13), lineCountDeleted: 1, lineCountInserted: 1, lineCountMoved: 3}
//   lineCountDeleted: 1
//   lineCountInserted: 1
//   lineCountMoved: 3
//   lines: Array(13)
//     0: {line: "to", aIndex: 10, bIndex: 0, moved: true}
//     1: {line: "protect", aIndex: 11, bIndex: 1, moved: true}
//     2: {line: "you", aIndex: 12, bIndex: 2, moved: true}
//     3: {line: "I", aIndex: 0, bIndex: 3}
//     4: {line: "was", aIndex: 1, bIndex: 4}
//     5: {line: "sent", aIndex: 2, bIndex: 5}
//     6: {line: "to", aIndex: 3, bIndex: 6}
//     7: {line: "the", aIndex: 4, bIndex: -1}
//     8: {line: "planet", aIndex: -1, bIndex: 7}
//     9: {line: "earth", aIndex: 5, bIndex: 8}
//     10: {line: "to", aIndex: 6, bIndex: -1, moved: true}
//     11: {line: "protect", aIndex: 7, bIndex: -1, moved: true}
//     12: {line: "you", aIndex: 8, bIndex: -1, moved: true}
//     length: 13

...表示str5相对于str6（“ the”）删除/丢失了一个单词，str6插入了一个单词（“ planet”），并且可能移动了3个单词（“ to”，“ protect”），＆“ you”）。在这种情况下，相似性度量为（13-3-1-1）/（13-3）或0.800。

假设您打算将一系列参考语句与用户输入进行比较，在这种情况下，您将需要通过patienceDiffPlus算法对所有参考语句运行用户输入，并选择最高相似度。 / p>

也就是说，您将需要根据参考句子对预期的用户输入进行较大的采样，以调整最适合您的应用程序的相似度度量方法。此外，您可能会发现必须删除标点符号，将所有内容设置为小写，删除常见的介词等，才能将用户输入内容归结为基础知识，以帮助进行匹配过程...

希望这会有所帮助。

检查一个字符串是否包含另一个字符串的某个部分并返回百分比

1 个答案: