这是用于检查两个字符串(str1
和str2
)的相似性百分比的代码。代码工作正常且完全准确。它会根据两个字符串之间的相似性记录一个介于0到1之间的数字(它会逐字检查相似性)。
因此,如果我们有以下字符串:
var str1 = "I was sent to earth to protect you"; // user input
var str2 = "I was sent to earth to protect you"; // reference
相似性结果将为1
。
现在,如果我们想将句子的一小部分与参考字符串进行比较该怎么办?
所以,如果我们有这些:
var str1 = "I was sent to earth"; // user input
var str2 = "I was sent to earth to protect you"; // reference
或这些:
var str1 = "I was sent to earth"; // user input
var str2 = "to protect you I was sent to earth"; // reference
预期的相似性结果应为1
。
这是我的代码:
function checkSimilarity(){
var str1 = "I was sent to earth";
var str2 = "I was sent to earth to protect you";
console.log(similarity(str1, str2));
}
function similarity(s1, s2) {
var longer = s1;
var shorter = s2;
if (s1.length < s2.length) {
longer = s2;
shorter = s1;
}
var longerLength = longer.length;
if (longerLength == 0) {
return 1.0;
}
return (longerLength - editDistance(longer, shorter)) / parseFloat(longerLength);
}
function editDistance(s1, s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
var costs = new Array();
for (var i = 0; i <= s1.length; i++) {
var lastValue = i;
for (var j = 0; j <= s2.length; j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
var newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length] = lastValue;
}
return costs[s2.length];
}
checkSimilarity();
谢谢。
答案 0 :(得分:0)
我建议使用patienceDiffPlus算法(请参阅https://github.com/jonTrent/PatienceDiff),该算法可用于比较两个相似的字符串数组。通常,此算法用于找出更新的计算机程序中的更改,但是在您的情况下,可以用于比较句子中的单词。具体来说,该算法搜索单词数组之间的最长公共子序列(LCS),并报告插入,删除,可能移动的单词数,并通过计算报告公共序列中相似单词的数量。
在计算平等程度时,根据您的示例,似乎参考文献中没有多余的词与该程度相对应。但是,不知道要与用户输入进行比较的参考句子的全部范围,我建议按照以下内容进行计算:
相似性=(Result.lines.length-Result.lineCountMoved-Result.lineCountDeleted-Result.lineCountInserted)/(Result.lines.length-Result.lineCountMoved)
使用第一个示例...
var str1 = "I was sent to earth"; // user input
var str2 = "I was sent to earth to protect you"; // reference
var compare12 = patienceDiffPlus(str1.split(" "), str2.split(" "));
console.log(compare12);
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}
...表示str2插入了另外3个单词,并且5个单词相等,并且顺序相同。相似度为(8-0-3)/(8-0)或0.625。现在,使用第二个示例...
var str3 = "I was sent to earth"; // user input
var str4 = "to protect you I was sent to earth"; // reference
var compare34 = patienceDiffPlus(str3.split(" "), str4.split(" "));
console.log(compare34)
// {lines: Array(8), lineCountDeleted: 0, lineCountInserted: 3, lineCountMoved: 0}
再次...表示str4插入了另外3个单词,并且5个单词相等,并且顺序相同。像以前一样,相似度为0.625。现在来看一个更复杂的示例...
var str5 = "I was sent to the earth to protect you"; // user input
var str6 = "to protect you I was sent to planet earth"; // reference
var compare56 = patienceDiffPlus(str5.split(" "), str6.split(" "));
console.log(compare56);
// {lines: Array(13), lineCountDeleted: 1, lineCountInserted: 1, lineCountMoved: 3}
// lineCountDeleted: 1
// lineCountInserted: 1
// lineCountMoved: 3
// lines: Array(13)
// 0: {line: "to", aIndex: 10, bIndex: 0, moved: true}
// 1: {line: "protect", aIndex: 11, bIndex: 1, moved: true}
// 2: {line: "you", aIndex: 12, bIndex: 2, moved: true}
// 3: {line: "I", aIndex: 0, bIndex: 3}
// 4: {line: "was", aIndex: 1, bIndex: 4}
// 5: {line: "sent", aIndex: 2, bIndex: 5}
// 6: {line: "to", aIndex: 3, bIndex: 6}
// 7: {line: "the", aIndex: 4, bIndex: -1}
// 8: {line: "planet", aIndex: -1, bIndex: 7}
// 9: {line: "earth", aIndex: 5, bIndex: 8}
// 10: {line: "to", aIndex: 6, bIndex: -1, moved: true}
// 11: {line: "protect", aIndex: 7, bIndex: -1, moved: true}
// 12: {line: "you", aIndex: 8, bIndex: -1, moved: true}
// length: 13
...表示str5相对于str6(“ the”)删除/丢失了一个单词,str6插入了一个单词(“ planet”),并且可能移动了3个单词(“ to”,“ protect”) ,&“ you”)。在这种情况下,相似性度量为(13-3-1-1)/(13-3)或0.800。
假设您打算将一系列参考语句与用户输入进行比较,在这种情况下,您将需要通过patienceDiffPlus算法对所有参考语句运行用户输入,并选择最高相似度。 / p>
也就是说,您将需要根据参考句子对预期的用户输入进行较大的采样,以调整最适合您的应用程序的相似度度量方法。此外,您可能会发现必须删除标点符号,将所有内容设置为小写,删除常见的介词等,才能将用户输入内容归结为基础知识,以帮助进行匹配过程...
希望这会有所帮助。