
时间:2013-06-09 19:34:22

标签: python regex nlp part-of-speech


  1. you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._。
  2. don_VB't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._。
  3. please_VB help_VB the_DT man_NN out_RP ._。
  4. shut_VBZ it_PRP down_RP!_。
  5. 我想将所有粒子(在示例中:out_RP,off_RP,out_RP,down_RP)移动到最近的前一个动词旁边(即与粒子组合构成短语动词的动词)。这是更改单词顺序后的行应该是什么:

    1. you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._。
    2. don_VB't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._。
    3. please_VB help_VB out_RP the_DT man_NN ._。
    4. shut_VBZ down_RP it_PRP!_。
    5. 到目前为止,我已尝试使用python和正则表达式来使用re.findall对问题进行排序:

      import re 
      print wordorder1


      (使用的标签取自Penn Treebank标签集(http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)(x表示可选字符以包含所有动词形式,*表示通配符))

      1. * _ VBX + * _ DT + * _ NN + * _ RP
      2. * _ VBX + * _ DT + * _ NNS + * _ RP
      3. * _ VBX + * _DT + * _。JJ + * _ NN + * _ RP
      4. * _ VBX + * _DT + * _。JJ + * _ NNS + * _ RP

      5. * _ VBX + * _ PRP $ + * _ NN + * _ RP

      6. * _ VBX + * _ PRP $ + * _ NNS + * _ RP
      7. * _ VBX + * _PRP $ + * _。JJ + * _ NN + * _ RP
      8. * _ VBX + * _PRP $ + * _。JJ + * _ NNS + * _ RP

      9. * _ VBX + * _ NNP + * _ RP

      10. * _ VBX + * _ JJ + * _ NNP + * _ RP

      11. * _ VBX + * _ +专业NNP * _ RP

      12. * _ VBx + * _ PRP + * _ RP

      13. 提前感谢您的帮助!

1 个答案:

答案 0 :(得分:3)


reordered_corpus = open('reordered_corpus.txt', 'w')
with open('corpus.txt', 'r') as corpus:
    for phrase in corpus:
        phrase = phrase.split()                 # split on whitespace
        vb_index = rp_index = -1                # variables for the indices
        for i, word_pos in enumerate(phrase):
            pos = word_pos.split('_')[1]        # POS at index 1 splitting on _
            if pos == 'VB' or pos == 'VBZ':     # can add more verb POS tags
                vb_index = i
            elif vb_index >= 0 and pos == 'RP': # or more particle POS tags
                rp_index = i
                break                           # found both so can stop
        if vb_index >= 0 and rp_index >= 0:     # do any rearranging
            phrase = phrase[:vb_index+1] + [phrase[rp_index]] + \
                     phrase[vb_index+1:rp_index] + phrase[rp_index+1:]
        reordered_corpus.write(' '.join(word_pos for word_pos in phrase)+'\n')


you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._.
don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB the_DT man_NN out_RP ._.
shut_VBZ it_PRP down_RP !_.


you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._.
don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB out_RP the_DT man_NN ._.
shut_VBZ down_RP it_PRP !_.