我试图用许多不同语言对这些词进行词形推理,而我正在使用treetagger提供的模型。问题是如果一种语言使用UTF-8字符(例如日语或保加利亚语),则输出编码不正确。
例如:
System.out.println(MyBulgarianLemmatizer.getInstance().getLemmatized("Рочестър е град в окръг Монро в щата Ню Йорк"));
将打印:???????? - ???? - ????? ????? - ???? ?? ????
这是MyLemmatizer类
public abstract class MyLemmatizer
{
private TreeTaggerWrapper<String> treeTagger;
private MyTokenHandler tokenHandler;
protected MyLemmatizer(String model)
{
File treeTaggerFile = new File("resources/treetagger/bin/tree-tagger");
treeTaggerFile.setExecutable(true);
tokenHandler = new MyTokenHandler();
System.setProperty("treetagger.home", "resources/treetagger");
treeTagger = new TreeTaggerWrapper<String>();
treeTagger.setHandler(tokenHandler);
try
{
getTreeTagger().setModel(model);
}
catch (IOException e)
{
e.printStackTrace();
}
}
protected TreeTaggerWrapper<String> getTreeTagger()
{
return treeTagger;
}
private void process(String phrase)
{
try
{
treeTagger.process(phrase.split(" "));
}
catch (IOException e)
{
e.printStackTrace();
}
catch (TreeTaggerException e)
{
e.printStackTrace();
}
}
public String getLemmatized(String phrase)
{
process(phrase);
String lemmatized = "";
for (String word : tokenHandler.getLemmas())
lemmatized += word + " ";
tokenHandler.reset();
return lemmatized.replaceAll(" $", "");
}
public void destroy()
{
treeTagger.destroy();
}
这是一个子类(如MyBulgarianLemmatizer)
的示例public class MyBulgarianLemmatizer extends MyLemmatizer
{
private static MyLemmatizer instance;
private static final String MODEL = "bulgarian.par:iso8859-1";
private MyBulgarianLemmatizer()
{
super(MODEL);
}
public static MyLemmatizer getInstance()
{
if (instance == null)
instance = new MyBulgarianLemmatizer();
return instance;
}
}
我还尝试删除文件字符串末尾的:iso8859-1,但输出仍然错误(它没有显示问号)
感谢