我试图将Apache Tika LanguageIdentifier限制为一系列语言。当我在下面运行我的代码时,它只检测文本为" de",因此德语。我想这样做的原因是让LanguageIdentifier因语言限制而表现更好。清理标准配置文件后,我只将我想要的配置文件添加到地图中,并使用此地图初始化LanguageIdentifier。
String[] languages = {"de", "en", "fr", "nl", "es"};
Map<String, LanguageProfile> languageMaps = new HashMap <String, LanguageProfile>();
LanguageIdentifier.clearProfiles();
for (String language : languages) {
LanguageProfile profile = new LanguageProfile();
languageMaps.put(language, profile);
}
LanguageIdentifier.initProfiles(languageMaps);
String docText = "Hello";
LanguageIdentifier identifier = new LanguageIdentifier(docText);
System.out.println(identifier.getLanguage());
当我运行以下代码时
LanguageIdentifierLanguageIdentifier.getSupportedLanguages())
它返回数组中的语言,所以我真的不知道出了什么问题。
答案 0 :(得分:1)
当您创建新的LanguageProfile
时,您必须再次为自己定义的语言添加所有ngram信息。使用此设置,您只需创建空容器,然后始终选择数组中的第一个,因为它是第一个,并且您没有任何其他信息。
请参阅API documentation of LanguageProfile
从中获取nedded语言ngram文件 http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/
初始化它(假设ngram文件在你的类路径根目录中):
String[] languages = { "de", "en" };
Map<String, LanguageProfile> languageMaps = new HashMap<String, LanguageProfile>();
LanguageIdentifier.clearProfiles();
for (String language : languages) {
LanguageProfile profile = new LanguageProfile();
InputStream stream;
try {
stream = new FileInputStream(new File("./" + language + ".ngp"));
BufferedReader reader = new BufferedReader(new InputStreamReader(stream, UTF_8));
String line = reader.readLine();
while (line != null) {
if (line.length() > 0 && !line.startsWith("#")) {
int space = line.indexOf(' ');
profile.add(line.substring(0, space), Long.parseLong(line.substring(space + 1)));
}
line = reader.readLine();
}
} catch (IOException e) {
throw new RuntimeException(e);
}
languageMaps.put(language, profile);
}
LanguageIdentifier.initProfiles(languageMaps);