更优雅的解决方案从一批文件中删除项目?

时间:2011-06-22 15:33:22

标签: regex perl

好的,这比我自己的学习更多,而不是实际需要。

我的文件格式如下:

Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.8 sec].
Parsing file: chpt1_1.txt
Parsing [sent. 1 len. 42]: [1.1, Organisms, Have, Changed, over, Billions, of, Years, 1, Long, before, the, mechanisms, of, biological, evolution, were, understood, ,, some, people, realized, that, organisms, had, changed, over, time, and, that, living, organisms, had, evolved, from, organisms, no, longer, alive, on, Earth, .]
(ROOT
  (S
    (S
      (NP (CD 1.1) (NNS Organisms))
      (VP (VBP Have)
        (VP (VBN Changed)
          (PP (IN over)
            (NP
              (NP (NNS Billions))
              (PP (IN of)
                (NP (NNP Years) (CD 1)))))
          (SBAR
            (ADVP (RB Long))
            (IN before)
            (S
              (NP
                (NP (DT the) (NNS mechanisms))
                (PP (IN of)
                  (NP (JJ biological) (NN evolution))))
              (VP (VBD were)
                (VP (VBN understood))))))))
    (, ,)
    (NP (DT some) (NNS people))
    (VP (VBD realized)
      (SBAR
        (SBAR (IN that)
          (S
            (NP (NNS organisms))
            (VP (VBD had)
              (VP (VBN changed)
                (PP (IN over)
                  (NP (NN time)))))))
        (CC and)
        (SBAR (IN that)
          (S
            (NP (NN living) (NNS organisms))
            (VP (VBD had)
              (VP (VBN evolved)
                (PP (IN from)
                  (NP
                    (NP (NNS organisms))
                    (ADJP
                      (ADVP (RB no) (RBR longer))
                      (JJ alive))))
                (PP (IN on)
                  (NP (NNP Earth)))))))))
    (. .)))

num(Organisms-2, 1.1-1)
nsubj(Changed-4, Organisms-2)
aux(Changed-4, Have-3)
ccomp(realized-22, Changed-4)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
num(Years-8, 1-9)
advmod(understood-18, Long-10)
dep(understood-18, before-11)
det(mechanisms-13, the-12)
nsubjpass(understood-18, mechanisms-13)
amod(evolution-16, biological-15)
prep_of(mechanisms-13, evolution-16)
auxpass(understood-18, were-17)
ccomp(Changed-4, understood-18)
det(people-21, some-20)

我需要删除所有不重要的依赖项(最后一部分)。然后保存新文件。这是我的工作代码:

#!usr/bin/perl
use strict;
use warnings;

##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob("parsed"."*.txt");

foreach my $file (@files) {
my @newfile;
    open(my $parse_corpus, '<', "$file") or die $!;
    while (my $sentences = <$parse_corpus>) {
    #print $sentences, "\n\n";
        if ($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) {
            if ($sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/) {
                push (@newfile, $sentences);
            }

        }
        else {
            push (@newfile, $sentences);
        }
    }
open(FILE ,'>', "select$file" );
print FILE @newfile;
close FILE
}

更改输出文件的一部分:

nsubj(Changed-4, Organisms-2)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
nsubjpass(understood-18, mechanisms-13)
prep_of(mechanisms-13, evolution-16)
nsubj(realized-22, people-21)
nsubj(changed-26, organisms-24)
prep_over(changed-26, time-28)
nsubj(evolved-34, organisms-32)
conj_and(changed-26, evolved-34)
prep_from(evolved-34, organisms-36)
prep_on(evolved-34, Earth-41)

是否有更好的方法,或者更优雅/更聪明的解决方案?

感谢您的时间,这纯粹是为了感兴趣,所以如果您没有时间,请不要帮忙。

1 个答案:

答案 0 :(得分:3)

如果我理解你的逻辑,你想默认打印到outfile,除非你遇到满足条件的'句子'。如果满足第一个条件,则只想在第二个条件为真时输出到outfile。在这种情况下,我倾向于“如果这个,下一个除非那个”逻辑,但那只是我。 ;)以下是您的代码示例。

use strict;
use warnings;
use autodie;

##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob( "parsed" . "*.txt" );

foreach my $file ( @files ) {
    open my $parse_corpus, '<', "$file";
    open my $outfile, '>', "select$file";
    while ( my $sentences = <$parse_corpus> ) {
        if( $sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/ ) {
            next unless $sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/;
        }
        print $outfile $sentences;
    }
}

我没有尝试重构你的正则表达式。我确实发现在输入文件的同时逐行处理输出文件的效率更令人愉悦。这消除了第二个循环,以及对输出数组的需求。

此外,我使用autodie pragma而不是在每次IO操作后指定'或die'。因为我在输出文件中使用了词法文件句柄,所以它会自行关闭。与autodie结合使用,隐式关闭甚至可以“启用”或“死亡”。