我把这句话从华尔街日报上传下来并通过斯坦福POS标签。奇怪的是,标签改变了#剧院"进入"剧院"
命令:
java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree
属性文件:
## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = _
encoding = UTF-8
iterations = 100
lang = english
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 5
rareWordThresh = 5
search = owlqn2
sgml = false
sigmaSquared = 0.5
regL1 = 0.75
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 4
输入句子:
((SINV(````)(S-TPC-2(PP(IN无)(NP(DT some)(JJ) 意外)(````)(FW妙招)(FW de)(FW 剧院)(''''))(,, ) (NP-SBJ(PRP I))(VP(VBP do)(RB n' t)(VP(VB见)(SBAR(WHNP-1)(WP) 什么)(S(NP-SBJ-1(-NONE- T ))(VP(MD将)(VP(VB块)(NP(DT) ()(NNP Paribas)(NN bid))))))))))(,,)('''')(VP(VBD表示)(S-2) (-NONE- T )))(NP-SBJ(NP(NNP Philippe)(NNP de)(NNP Cholet))(,,) (NP(NP(NN分析师))(PP-LOC(IN at)(NP(NP(DT))(NN经纪)) (NP(NNP Cholet)(HYPH - )(NNP Dupont)(CC&)(NNP Cie))))))(。))))
输出:
``_`` without_IN some_DT unexpected_JJ```` coup_NN de_IN theater_NN '' _'' ,_,I_PRP do_VBP n' t_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN,_,'' _'' said_VBD Philippe_NNP de_IN Cholet_NNP,_,analyst_NN at_IN the_DT brokerage_NN Cholet_NNP -_HYPH Dupont_NNP& _CC Cie_NNP ._。
答案 0 :(得分:2)
根据我的理解,斯坦福POS标记器使用美国英语培训数据进行培训。在运行时,我们将输入数据“美化”,以确保标记器正确识别它。请参阅配置文件中的以下行:
@RequestMapping
如果您以编程方式访问CoreNLP,则可以通过CoreLabel.originalText
检索预美式表单。您也可以禁用@RequestMapping(value="/test/{test}", method = RequestMethod.GET)
,但结果可能会看到一些不正确的输出。