我正在尝试创建一个句子分析器,它可以读取文档并预测正确的分数来分解句子而不会破坏不重要的时期,如“博士”或“.NET”,所以我一直在尝试使用CoreNLP
一旦意识到PCFG的运行速度太慢(并且基本上是我整个工作的瓶颈),我试图切换到Shift-Reduce解析(根据coreNLP网站的速度更快)。
但是,SRParser的运行速度非常慢,我不知道为什么(因为PCFG每秒处理1000个句子,SRParser正在执行100个句子。)
以下是两者的代码。可能值得注意的一件事是,每个“文档”都有大约10-20个句子,因此它们非常小:
PCFG解析器:
class StanfordPCFGParser {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
var i = 0
val time = java.lang.System.currentTimeMillis()
def parseSentence(doc:String ):List[String] = {
val tokens = new Annotation(doc)
pipeline.annotate(tokens)
val sentences = tokens.get(classOf[SentencesAnnotation]).toList
sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
sentences.map(_.toString)
}
}
Shift-Reduce Parser:
class StanfordShiftReduceParser {
val p = new Properties()
p.put("annotators", "tokenize ssplit pos parse lemma ")
p.put("parse.model", "englishSR.ser.gz")
val corenlp = new StanfordCoreNLP(p)
var i = 0
val time = java.lang.System.currentTimeMillis()
def parseSentences(text:String) = {
val annotation = new Annotation(text)
corenlp.annotate(annotation)
val sentences = annotation.get(classOf[SentencesAnnotation]).toList
sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
sentences.map(_.toString)
}
}
以下是我用于计时的代码:
val originalParser = new StanfordPCFGParser
println("starting PCFG")
var time = getTime
sentences.foreach(originalParser.parseSentence)
time = getTime - time
println("PCFG parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + originalParser.i + "sentences")
val srParser = new StanfordShiftReduceParser
println("starting SRParse")
time = getTime()
sentences.foreach(srParser.parseSentences)
time = getTime - time
println("SR parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + srParser.i + "sentences")
这给了我以下输出(我已经解析了由于可疑数据源而发生的“无法触发”警告)
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... starting PCFG
done [0.6 sec].
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 1 seconds
parsed 2000in 2 seconds
parsed 3000in 3 seconds
parsed 4000in 5 seconds
parsed 5000in 5 seconds
parsed 6000in 6 seconds
parsed 7000in 7 seconds
parsed 8000in 8 seconds
parsed 9000in 9 seconds
PCFG parser took 10.158 seconds for 1000 documents to 9558 sentences
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator parse
Loading parser from serialized file englishSR.ser.gz ... done [8.3 sec].
starting SRParse
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 17 seconds
parsed 2000in 30 seconds
parsed 3000in 43 seconds
parsed 4000in 56 seconds
parsed 5000in 66 seconds
parsed 6000in 77 seconds
parsed 7000in 90 seconds
parsed 8000in 101 seconds
parsed 9000in 113 seconds
SR parser took 120.506 seconds for 1000 documents to 9558 sentences
非常感谢任何帮助!
答案 0 :(得分:2)
如果你需要做的只是将一段文字拆分成句子,你只需要tokenize
和ssplit
注释器。解析器完全是多余的。所以:
props.put("annotators", "tokenize, ssplit")