为什么Stanford POS标签修改输入句子?

时间:2015-12-16 12:23:35

标签: stanford-nlp

我把这句话从华尔街日报上传下来并通过斯坦福POS标签。奇怪的是,标签改变了#剧院"进入"剧院"

命令:

java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree

属性文件:

## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
                    arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
            wordFunction = edu.stanford.nlp.process.AmericanizeFunction
         closedClassTags =
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix =
            tagSeparator = _
                encoding = UTF-8
              iterations = 100
                    lang = english
    learnClosedClassTags = false
        minFeatureThresh = 2
           openClassTags =
rareWordMinFeatureThresh = 5
          rareWordThresh = 5
                  search = owlqn2
                    sgml = false
            sigmaSquared = 0.5
                   regL1 = 0.75
               tagInside =
                tokenize = true
        tokenizerFactory =
        tokenizerOptions =
                 verbose = false
          verboseResults = true
    veryCommonWordThresh = 250
                xmlInput =
              outputFile =
            outputFormat = slashTags
     outputFormatOptions =
                nthreads = 4

输入句子:

  

((SINV(````)(S-TPC-2(PP(IN无)(NP(DT some)(JJ)   意外)(````)(FW妙招)(FW de)(FW 剧院)(''''))(,, )   (NP-SBJ(PRP I))(VP(VBP do)(RB n' t)(VP(VB见)(SBAR(WHNP-1)(WP)   什么)(S(NP-SBJ-1(-NONE- T ))(VP(MD将)(VP(VB块)(NP(DT)   ()(NNP Paribas)(NN bid))))))))))(,,)('''')(VP(VBD表示)(S-2)   (-NONE- T )))(NP-SBJ(NP(NNP Philippe)(NNP de)(NNP Cholet))(,,)   (NP(NP(NN分析师))(PP-LOC(IN at)(NP(NP(DT))(NN经纪))   (NP(NNP Cholet)(HYPH - )(NNP Dupont)(CC&)(NNP Cie))))))(。))))

输出:

  

``_`` without_IN some_DT unexpected_JJ```` coup_NN de_IN    theater_NN '' _'' ,_,I_PRP do_VBP n' t_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN,_,'' _'' said_VBD Philippe_NNP   de_IN Cholet_NNP,_,analyst_NN at_IN the_DT brokerage_NN Cholet_NNP   -_HYPH Dupont_NNP& _CC Cie_NNP ._。

1 个答案:

答案 0 :(得分:2)

根据我的理解,斯坦福POS标记器使用美国英语培训数据进行培训。在运行时,我们将输入数据“美化”,以确保标记器正确识别它。请参阅配置文件中的以下行:

@RequestMapping

如果您以编程方式访问CoreNLP,则可以通过CoreLabel.originalText检索预美式表单。您也可以禁用@RequestMapping(value="/test/{test}", method = RequestMethod.GET) ,但结果可能会看到一些不正确的输出。