Perl:我如何拆分这些文本以提取所需信息?

时间:2011-06-08 18:29:02

标签: perl split

已编辑/缩短版本

我有两个文本,它们来自我必须循环的两个文件(你可以忽略我的变量)。以下是各自的示例:

标记

5.4_CD Passive_NNP Processes_NNP of_IN Membrane_NNP Transport_NNP 85_CD We_PRP have_VBP examined_VBN membrane_NN structure_NN and_CC how_WRB it_PRP is_VBZ used_VBN to_TO perform_VB one_CD membrane_NN function_NN :_: the_DT binding_JJ of_IN one_CD cell_NN to_TO another_DT ._.

期望的输出:

5.4 Passive Processes of Membrane Transport 85 We have examined membrane stru....

解析

   Parsing [sent. 1 len. 31]:
        nsubj(85-7, Processes-3)
        nn(Transport-6, Membrane-5)
        prep_of(Processes-3, Transport-6)
        nsubj(examined-10, We-8)
        nsubjpass(used-17, it-15)
        xsubj(perform-19, it-15)
        conj_and(examined-10, used-17)
        xcomp(used-17, perform-19)
        dobj(perform-19, function-22)
        prep_of(binding-25, cell-28) <- refer to this for examples below

期望的输出:

  • 发送。号码(即sent. 1
  • 语法功能(即prep_of
  • 第一个依赖词(即binding
  • 第二个依赖词(即cell

问题

如何分割/替换这些以获得我想要的输出,以便它们在结尾和开头保持一个单词边界(=~ \bword\b应该适用)??

感谢您花时间阅读本文!任何建议表示赞赏!

1 个答案:

答案 0 :(得分:3)

好吧,即使你修改过的问题我也很难理解。由于我不理解你的想法,我已经跳过了你的历史问题,我想我会分享一个更好的解释。建议您跳过背景资料,然后将问题分解为:

@subsentences = ("5.4_CD Passive_NNP Processes_NNP","85_CD We_PRP have_VBP examined_VBN membrane_NN");
foreach my $sub (@subsentences) {
  @final = split(/_\S+/,$sub);
  print join(",",@final)."\n";
}

Expected output:  ("5.4", "Passive", "Process") and ("85", "We", "have", "examined").

可悲的是,我甚至无法判断我对你在这个例子中可能意味着什么的猜测是否正确(可能你的意思是@subsentence = qw(5.4_CD Passive_NNP Processes_NNP)而不是其他?)。对每个例子重复一遍。假设我猜对了,在这个例子中你想要的正则表达式是:

@finalsentence = split(/_\S+(?:\s+|$)/,$subsentences[$j])

或同样有效的(?)

@finalsentence = grep(s/_\S+//||1,split(/\s+/,$subsentences[$j]));

我想我们发现他想问的实际问题是:

@subs = qw(5.4_CD Passive_NNP Processes_NNP);
Expected output: qw(5.4 Passive Processes)

如果我的修改后的理解是正确的,以下将做你想做的事

@subs = qw(5.4_CD Passive_NNP Processes_NNP);
@final = @subs;
grep(s/_\S+//,@final);
print join(",",@final)."\n";