我尝试对生物医学文本进行标记,因此我决定使用http://nlp.stanford.edu/software/eventparser.shtml。我使用了独立程序RunBioNLPTokenizer来完成我想要的工作。
现在,我想创建一个使用斯坦福库的自己的程序。所以,我从下面的RunBioNLPTokenizer中读取了代码。
package edu.stanford.nlp.ie.machinereading.domains.bionlp;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Collection;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ie.machinereading.GenericDataSetReader;
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;
/**
* Standalone program to run our BioNLP tokenizer and save its output
*/
public class RunBioNLPTokenizer extends GenericDataSetReader {
public static void main(String[] args) throws IOException {
Properties props = StringUtils.argsToProperties(args);
String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/");
DataSet dataset = new GENIA11DataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "genia/training");
runTokenizerForDirectory(dataset, basePath + "genia/development");
runTokenizerForDirectory(dataset, basePath + "genia/testing");
dataset = new EpigeneticsDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "epi/training");
runTokenizerForDirectory(dataset, basePath + "epi/development");
runTokenizerForDirectory(dataset, basePath + "epi/testing");
dataset = new InfectiousDiseasesDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "infect/training");
runTokenizerForDirectory(dataset, basePath + "infect/development");
runTokenizerForDirectory(dataset, basePath + "infect/testing");
}
private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException {
System.out.println("Input directory: " + path);
BioNLPFormatReader reader = new BioNLPFormatReader();
for (File rawFile : reader.getRawFiles(path)) {
System.out.println("Input filename: " + rawFile.getName());
String rawText = IOUtils.slurpFile(rawFile);
String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, "");
String parentPath = rawFile.getParent();
runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText);
}
}
private static void runTokenizer(String tokenizedFilename, String text) {
System.out.println("Tokenized filename: " + tokenizedFilename);
Collection<String> sentences = BioNLPFormatReader.splitSentences(text);
PrintStream os = null;
try {
os = new PrintStream(new FileOutputStream(tokenizedFilename));
} catch (IOException e) {
System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename);
e.printStackTrace();
System.exit(1);
}
for (String sentence : sentences) {
BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence);
List<CoreLabel> tokens = tokenizer.tokenize();
for (CoreLabel l : tokens) {
os.print(l.word() + " ");
}
os.println();
}
os.close();
}
}
我写了下面的代码。我实现了将文本拆分成句子但我不能使用BioNLPTokenizer,因为它在RunBioNLPTokenizer中使用。
public static void main(String[] args) throws Exception {
// TODO code application logic here
Collection<String> c =BioNLPFormatReader.splitSentences("..");
for (String sentence : c) {
System.out.println(sentence);
BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence);
}
}
我接受了这个错误
线程“main”中的异常java.lang.RuntimeException:无法编译的源代码 - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer在edu.stanford.nlp.ie.machinereading中具有受保护的访问权限。 domains.bionlp.BioNLPFormatReader
我的问题是。如何在不使用RunBioNLPTokenizer的情况下根据斯坦福大学图书馆对生物医学句子进行标记?
答案 0 :(得分:0)
不幸的是,我们将BioNLPTokenizer
作为protected
内部类,因此您需要编辑源并更改其对public
的访问权限。
请注意BioNLPTokenizer
可能不是最通用的生物医学句子tokenzier - 我会检查输出以确保它是合理的。我们在BioNLP 2009/2011共享任务中大力发展它。