我需要找出两个输入词之间的语义相似性/相关性。以下单词在现实世界中类似或相关: -
- genuineness, genuine, genuinely, valid, reality, fact, really
- painter, painting, paint
以下是我从here
获取的代码 ILexicalDatabase db = new NictWordNet();
RelatednessCalculator lin = new Lin(db);
RelatednessCalculator wup = new WuPalmer(db);
RelatednessCalculator path = new Path(db);
String w1 = "truth";
String w2 = "genuine";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));
我在eclipse 3.4中使用WS4J Api(ws4j1.0.1.jar)和java 1.7。我得到的结果没有任何意义,或者可能是我的看法是错误的。
如果我的方法有误,请告诉我是否想要弄清楚单词之间的相似性,然后我应该使用其他api。
答案 0 :(得分:1)
看起来在您配置的数据集中找不到单词,因此它只返回0.0
的分数:例如,以下无意义的单词会导致得分0.0
为好:
ILexicalDatabase db = new NictWordNet();
RelatednessCalculator lin = new Lin(db);
RelatednessCalculator wup = new WuPalmer(db);
RelatednessCalculator path = new Path(db);
String w1 = "iamatotallycompletelyfakewordwithagermanwordinsidevergnügen";
String w2 = "iamevenmorefakeandstrangerossiskajafoderatsija";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));
不幸的是,我无法告诉您的配置是什么样的,而且您提供的链接似乎不起作用(至少更多)。但是,Google Code处的ws4j 1.0.1的JAR包含其自己的信息内容文件(名为 ic-semcor.dat ),该文件在文件 similarity.conf中配置:
# ----------------------------------------------------------------------
# The following option is supported by :
# res, lin, jcn
infocontent = ic-semcor.dat
# Specifies the name of an information content file under
# data/. The value of this option must be the name of a
# file, or a relative or absolute path name. The default
# value of this option ic-semcor.dat.
使用此设置,我会为您在表格中列出的字词获得相同的结果。因此,您应该更多地了解不同指标的各个RelatednessCalculator
实施的配置。