Question

我试图在 shell 脚本中找出两个句子之间的相似之处。

有一个包含重复词的两个句子，例如文件my_text.txt中的输入数据

Shell Script.
Linux Shell Script.

两个句子的交集：Shell + Script
两个句子的并集“size”：3

句子相似度的正确输出：

 0.30000000000000000000

相似度**的定义是两个句子之间的词的交集除以两个句子的并集大小。

问题：我已经尝试了很多找到一个shell脚本，但是我没有找到解决这个问题的方法。

Answer 1

以下脚本应该可以解决问题。它还会忽略您在评论部分中描述的每个句子的重复词、填充词和非字母字符。

words=$(
  < my_text.txt tr 'A-Z' 'a-z' |
  grep -Eon '\b[a-z]*\b' |
  grep -Fwvf <(printf %s\\n is a to be by the and for) |
  sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"

示例输入的输出为 .30000000000000000000 (= 0.3)。

Answer 2

这是您想要做的吗（将 GNU awk 用于 FPAT 和数组数组）？

$ cat tst.awk
BEGIN {
    split("is a to be by the and for",tmp)
    for (i in tmp) {
        stopwords[tmp[i]]
    }
    FPAT="[[:alnum:]_]+"
}
{
    for (i=1; i<=NF; i++) {
        word = tolower($i)
        if ( !(word in stopwords) ) {
            words[NR][word]
        }
    }
}
END {
    for (word in words[1]) {
        if (word in words[2]) {
            numCommon++
        }
    }
    totWords = length(words[1]) + length(words[2]) - numCommon
    print (totWords ? numCommon / totWords : 0)
}

$ awk -f tst.awk file
0.666667

如何找到句子之间的相似性？

2 个答案: