BioNLP stanford - 标记化

时间:2016-10-05 10:23:34

标签: java nlp stanford-nlp

我尝试对生物医学文本进行标记,因此我决定使用http://nlp.stanford.edu/software/eventparser.shtml。我使用了独立程序RunBioNLPTokenizer来完成我想要的工作。

现在,我想创建一个使用斯坦福库的自己的程序。所以,我从下面的RunBioNLPTokenizer中读取了代码。

package edu.stanford.nlp.ie.machinereading.domains.bionlp;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Collection;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ie.machinereading.GenericDataSetReader;
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;

/**
 * Standalone program to run our BioNLP tokenizer and save its output
 */
public class RunBioNLPTokenizer extends GenericDataSetReader {

  public static void main(String[] args) throws IOException {
    Properties props = StringUtils.argsToProperties(args);
    String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/");

    DataSet dataset = new GENIA11DataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "genia/training");
    runTokenizerForDirectory(dataset, basePath + "genia/development");
    runTokenizerForDirectory(dataset, basePath + "genia/testing");

    dataset = new EpigeneticsDataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "epi/training");
    runTokenizerForDirectory(dataset, basePath + "epi/development");
    runTokenizerForDirectory(dataset, basePath + "epi/testing");

    dataset = new InfectiousDiseasesDataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "infect/training");
    runTokenizerForDirectory(dataset, basePath + "infect/development");
    runTokenizerForDirectory(dataset, basePath + "infect/testing");
  }

  private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException {
    System.out.println("Input directory: " + path);
    BioNLPFormatReader reader = new BioNLPFormatReader();    
    for (File rawFile : reader.getRawFiles(path)) {
      System.out.println("Input filename: " + rawFile.getName());
      String rawText = IOUtils.slurpFile(rawFile);

      String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, "");
      String parentPath = rawFile.getParent();

      runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText);
    }
  }

  private static void runTokenizer(String tokenizedFilename, String text) {
    System.out.println("Tokenized filename: " + tokenizedFilename);
    Collection<String> sentences = BioNLPFormatReader.splitSentences(text);

    PrintStream os = null;
    try {
      os = new PrintStream(new FileOutputStream(tokenizedFilename));
    } catch (IOException e) {
      System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename);
      e.printStackTrace();
      System.exit(1);
    }

    for (String sentence : sentences) {
      BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence);
      List<CoreLabel> tokens = tokenizer.tokenize();
      for (CoreLabel l : tokens) {
        os.print(l.word() + " ");
      }
      os.println();
    }
    os.close();
  }
}

我写了下面的代码。我实现了将文本拆分成句子但我不能使用BioNLPTokenizer,因为它在RunBioNLPTokenizer中使用。

public static void main(String[] args) throws Exception {
  // TODO code application logic here
  Collection<String> c =BioNLPFormatReader.splitSentences("..");
  for (String sentence : c) {
    System.out.println(sentence);
    BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence);
  }
} 

我接受了这个错误

  

线程“main”中的异常java.lang.RuntimeException:无法编译的源代码 - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer在edu.stanford.nlp.ie.machinereading中具有受保护的访问权限。 domains.bionlp.BioNLPFormatReader

我的问题是。如何在不使用RunBioNLPTokenizer的情况下根据斯坦福大学图书馆对生物医学句子进行标记?

1 个答案:

答案 0 :(得分:0)

不幸的是,我们将BioNLPTokenizer作为protected内部类,因此您需要编辑源并更改其对public的访问权限。

请注意BioNLPTokenizer可能不是最通用的生物医学句子tokenzier - 我会检查输出以确保它是合理的。我们在BioNLP 2009/2011共享任务中大力发展它。