TreeTagger上java错误输出带有特殊字符

时间:2014-06-08 10:27:02

标签: java encoding utf-8 nlp lemmatization

我试图用许多不同语言对这些词进行词形推理,而我正在使用treetagger提供的模型。问题是如果一种语言使用UTF-8字符(例如日语或保加利亚语),则输出编码不正确。

例如:

System.out.println(MyBulgarianLemmatizer.getInstance().getLemmatized("Рочестър е град в окръг Монро в щата Ню Йорк"));

将打印:???????? - ???? - ????? ????? - ???? ?? ????

这是MyLemmatizer类

public abstract class MyLemmatizer
{
    private TreeTaggerWrapper<String> treeTagger;
    private MyTokenHandler tokenHandler;

    protected MyLemmatizer(String model)
    {
        File treeTaggerFile = new File("resources/treetagger/bin/tree-tagger");
        treeTaggerFile.setExecutable(true);
        tokenHandler = new MyTokenHandler();
        System.setProperty("treetagger.home", "resources/treetagger");
        treeTagger = new TreeTaggerWrapper<String>();
        treeTagger.setHandler(tokenHandler);

        try
        {
            getTreeTagger().setModel(model);
        }

        catch (IOException e)
        {
            e.printStackTrace();
        }
    }

    protected TreeTaggerWrapper<String> getTreeTagger()
    {
        return treeTagger;
    }

    private void process(String phrase)
    {
        try
        {
            treeTagger.process(phrase.split(" "));
        }

        catch (IOException e)
        {
            e.printStackTrace();
        }

        catch (TreeTaggerException e)
        {
            e.printStackTrace();
        }
    }

    public String getLemmatized(String phrase)
    {
        process(phrase);
        String lemmatized = "";

        for (String word : tokenHandler.getLemmas())
            lemmatized += word + " ";

        tokenHandler.reset();

        return lemmatized.replaceAll(" $", "");
    }

    public void destroy()
    {
        treeTagger.destroy();
    }

这是一个子类(如MyBulgarianLemmatizer)

的示例
public class MyBulgarianLemmatizer extends MyLemmatizer
{
    private static MyLemmatizer instance;
    private static final String MODEL = "bulgarian.par:iso8859-1";

    private MyBulgarianLemmatizer()
    {
        super(MODEL);
    }

    public static MyLemmatizer getInstance()
    {
        if (instance == null)
            instance = new MyBulgarianLemmatizer();

        return instance;
    }
}

我还尝试删除文件字符串末尾的:iso8859-1,但输出仍然错误(它没有显示问号)

感谢

0 个答案:

没有答案