Question

我刚开始使用中文Word Segmenter，我想在我的Android应用程序中使用它（主要用于解析Tatoeba示例句子）。我不确定从哪里开始，我正在寻找在Android上使用它的文档和/或示例。另外，我在Android Studio中将.jar导入为库，但是我无法将Source和Javadoc添加到库中，主要问题是库未显示在项目视图的外部库文件夹中。我的一些好的开始问题是：

我需要使用哪些课来分割文字？
分段人如何处理英文名字？
this以外的文档是否有页面？（我需要在Java中使用它的文档，而不仅仅是作为命令行工具）
在Android中是否有使用分段器的示例？
我是否还需要CoreNLP库？
Stanford分离器是否有更简单的替代方案？

很抱歉这样一个基本问题，但我现在真的不明白如何使用它

Answer 1

你需要在你的CLASSPATH中包含一个带有代码的jar，你需要一个包含这些文件的jar（它们都可以在中文模型jar中找到）：

edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
edu/stanford/nlp/models/segmenter/chinese/ctb.gz
StanfordCoreNLP-chinese.properties

您可以在主发行版中包含此jar以获取代码：

stanford-corenlp-3.8.0.jar

上面引用的文件可以在这里找到的中文模型jar：https://stanfordnlp.github.io/CoreNLP/download.html

如果你想将它与Android应用程序集成，你将不得不创建一些小罐子，因为有非常严格的尺寸要求。我建议删除大部分只运行分段器所不需要的代码。

如果你使用stanford-corenlp-3.8.0.jar，这是一些示例代码：

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class PipelineExample {

  public static void main(String[] args) {
    // set up pipeline properties
    Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");
    props.setProperty("annotators", "tokenize,ssplit");
    // set up Stanford CoreNLP pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // build annotation for a review
    Annotation annotation = new Annotation("...Chinese text to segment...");
    // annotate the review
    pipeline.annotate(annotation);
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.println(token);
      }
    }
  }
}

您需要做一些工作来将罐子缩小到最低限度，最终这需要删除大量的类并确保仍能正常运行。

你也可以下载独立的分段器，它运行相同的过程，更多信息在这里：

https://nlp.stanford.edu/software/segmenter.html

使用独立的分段器分发可能更容易。它将有一个名为SegDemo.java的演示，在这种情况下显示Java API用法。如果您使用独立分段器中的类，则上面提供的示例代码将不起作用。

Android Studio中的斯坦福中文分词器

1 个答案: