用于语义相似性/相关性的Java API,用于两个" WORDS"

时间:2015-02-24 18:11:56

标签: java semantics similarity wordnet ws4j

我需要找出两个输入词之间的语义相似性/相关性。以下单词在现实世界中类似或相关: -

- genuineness, genuine, genuinely, valid, reality, fact, really   
- painter, painting, paint

以下是我从here

获取的代码
    ILexicalDatabase db = new NictWordNet();
    RelatednessCalculator lin = new Lin(db);
    RelatednessCalculator wup = new WuPalmer(db);
    RelatednessCalculator path = new Path(db);

        String w1 = "truth";
        String w2 = "genuine";
        System.out.println(lin.calcRelatednessOfWords(w1, w2));
        System.out.println(wup.calcRelatednessOfWords(w1, w2));
        System.out.println(path.calcRelatednessOfWords(w1, w2));

我在eclipse 3.4中使用WS4J Api(ws4j1.0.1.jar)和java 1.7。我得到的结果没有任何意义,或者可能是我的看法是错误的。

enter image description here

如果我的方法有误,请告诉我是否想要弄清楚单词之间的相似性,然后我应该使用其他api。

1 个答案:

答案 0 :(得分:1)

看起来在您配置的数据集中找不到单词,因此它只返回0.0的分数:例如,以下无意义的单词会导致得分0.0为好:

ILexicalDatabase db = new NictWordNet();
RelatednessCalculator lin = new Lin(db);
RelatednessCalculator wup = new WuPalmer(db);
RelatednessCalculator path = new Path(db);

String w1 = "iamatotallycompletelyfakewordwithagermanwordinsidevergnügen";
String w2 = "iamevenmorefakeandstrangerossiskajafoderatsija";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));

不幸的是,我无法告诉您的配置是什么样的,而且您提供的链接似乎不起作用(至少更多)。但是,Google Code处的ws4j 1.0.1的JAR包含其自己的信息内容文件(名为 ic-semcor.dat ),该文件在文件 similarity.conf中配置

# ----------------------------------------------------------------------
# The following option is supported by :
#               res, lin, jcn

infocontent = ic-semcor.dat

            # Specifies the name of an information content file under 
            # data/. The value of this option must be the name of a 
            # file, or a relative or absolute path name. The default 
            # value of this option ic-semcor.dat.

使用此设置,我会为您在表格中列出的字词获得相同的结果。因此,您应该更多地了解不同指标的各个RelatednessCalculator实施的配置。