如何从文件中以特定字符开头和结尾的部分删除某些停用词

时间:2012-04-12 20:16:03

标签: perl word

.I 1
.T
Alice in wonderland
She follows it down a rabbit hole when suddenly 
she falls a long way to a curious hall with many locked doors of all sizes. 
She finds a small key to a door too small for her to fit through.
.B
CACM wolf dog December, 1958
.A
Perlis, A. J.
Samelson,K.
.N
CA581203 JB March 22, 1978  8:28 PM
.X
100 5   1
123 5   1
164 5   1
.I 2
.T
Extraction of Roots by Repeated Subtractions for Digital Computers
the contents of which cause her to shrink too small to reach the key
which she has left on the table.
A cake with "EAT ME" on it causes her to grow.
.B
CACM December, 1958
.A
Sugai, I.
.N
CA581202 JB March 22, 1978  8:29 PM
.X
2   5   2
2   5   2
2   5   2

现在上面的文字是2个文件的内容,新文件从.I开头(后跟一个数字) 我需要在文本中停止.T& .B,.B& .A,.A& .N,.N& .X并删除.X和新文档开头之间的所有文本。即.I(后跟一个数字)

“输出应该看起来像”

.I 1
.T
Alice wonderland
follows rabbit hole suddenly 
falls long way curious hall locked doors sizes 
door small fit through
.B
CACM wolf dog December, 1958
.A
Perlis, A. J.
Samelson,K.
.N
CA581203 JB March 22, 1978  8:28 PM
.X
.I 2
.T
Extraction Roots Repeated Subtractions Digital Computers
contents cause shrink
left table
cake with EAT causes grow
.B
CACM December, 1958
.A
Sugai, I.
.N
CA581202 JB March 22, 1978  8:29 PM
.X

我需要停止在.T&之间出现的文字上的文字。 .B,.B& .A,.A& .N,.N& .X

1 个答案:

答案 0 :(得分:0)

第一步是将每个块分成一个合适的数据结构。以下脚本可以做到这一点。获得%segments后,您可以根据需要修改和重新组合每个块。

#!/usr/bin/env perl

use strict; use warnings;
use Data::Dumper;

my %stops = map { $_ => 1 } qw(a all of in);
run(\*DATA, \%stops);

sub run {
    my $fh = shift;
    my $stops = shift;

    local $/ = '.I';

    my $pat = qr{
        ^[ ] (?<I> [0-9]+) \n
        ^[.] T \n (?<T> .+)
        ^[.] B \n (?<B> .+)
        ^[.] A \n (?<A> .+)
        ^[.] N \n (?<N> .+)
        ^[.] X \n (?<X> .+)
    }xms;

    while (my $chunk = <$fh>) {
        chomp $chunk;
        next unless $chunk;

        if ($chunk =~ $pat) {
            my %segments = %+;
            print Dumper \%segments;
        }
    }
}