需要帮助执行perl令牌脚本

时间:2015-06-30 11:31:15

标签: regex perl tokenize

我是一个Perl业余爱好者。最近我得到了一个Perl脚本,它接受一个文本文件并删除所有格式,除了单个单词后跟一个空格。问题是脚本不清楚如何输入文件位置。我已经设置了一些代码来运行整个文件目录,但还没有能够让代码执行。我将发布原始代码,然后是我添加的内容。谢谢你的帮助!

原件:

while(<>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

     s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

我的变化:

#!/usr/local/bin/perl

$dirtoget="1999_txt/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD); #
closedir(IMD);
    foreach $f (@thefiles)
    {
        unless ( ($f eq ".") || ($f eq "..") )
        {
            $fr="$dirtoget$f";
            open(FILEREAD, "< $fr");

$x="";
while($line = <FILEREAD>) { $x .= $line; } # read the whole file into one string
close FILEREAD;

print "$x/n";   
while(<$x>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

    s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

}}

2 个答案:

答案 0 :(得分:1)

您真的不需要编辑原始脚本以将其应用于目录的内容。在这种情况下,shell将成为你的朋友。

您的第一个脚本将读取作为参数传递的每个文件,或默认情况下读取stdin的内容。换句话说,您可以像这样调用原始脚本:

$ ./script file > output
$ cat file | ./script | less

如果要解析所有文件,仍然可以使用shell:

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

这个简短的例子可能更清楚:

考虑一个名为script的类似脚本:

#!/usr/bin/perl 
while(<>) {
   chomp
   print ">$_<\n";
}
print "\n";

现在,从你的shell你可以做到:

$ mkdir foo && cd foo
$ echo -e "Hello\nYou\nI am A" >> a.txt
$ echo -e "Hello\nYou\nI am A" >> b.txt

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

$ ls 
a.txt  a.txt.out  b.txt  b.txt.out  script  script.out
$ cat a.txt.out
>Hello<
>You<
>I am A<

答案 1 :(得分:1)

您的主要问题是您打开每个文件并将其内容读入$x,然后将$x作为文件句柄传递给原始循环。但它不是文件句柄 - 它只是纯文本。如果您只是省略了对文件的读取,那么您的代码即将开始工作

我认为这会像你问的那样做。它使用glob优先于opendir / readdir,因为它更简洁

#!/usr/local/bin/perl

use strict;
use warnings;

while ( my $file = glob '1999_txt/*' ) {

    next unless -f $file;

    open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

    while ( <$fh> ) {
        chomp;

        s/<[^<>]*>//g;             # Remove HTML tags
        tr/A-Z/a-z/;               # downcase

        s/([a-z]+|[^a-z]+)/$1 /g;  # separate letter strings from other types of sequences

        s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

        s/[0-9]+/#/g;              # map numerical strings to #

        s/\s+/ /g;                 # these three lines clean up whitespace
        s/^\s+//;                  # so it's always exactly one space
        s/\s+$//;                  # between words, no newlines

        print if /\S/;             # print what's left if it's not just whitespace
    }

    print "\n"; # final newline, so whole doc is on one line that ends in newline
}