我是一个Perl业余爱好者。最近我得到了一个Perl脚本,它接受一个文本文件并删除所有格式,除了单个单词后跟一个空格。问题是脚本不清楚如何输入文件位置。我已经设置了一些代码来运行整个文件目录,但还没有能够让代码执行。我将发布原始代码,然后是我添加的内容。谢谢你的帮助!
原件:
while(<>) {
chomp;
s/\<[^<>]*\>//g; # eliminate markup
tr/[A-Z]/[a-z]/; # downcase
s/([a-z]+|[^a-z]+)/\1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/\#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up white space (so it's always exactly one space between words, no newlines
s/^\s+//;
s/\s+$/ /;
print if(m/\S/); # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
我的变化:
#!/usr/local/bin/perl
$dirtoget="1999_txt/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD); #
closedir(IMD);
foreach $f (@thefiles)
{
unless ( ($f eq ".") || ($f eq "..") )
{
$fr="$dirtoget$f";
open(FILEREAD, "< $fr");
$x="";
while($line = <FILEREAD>) { $x .= $line; } # read the whole file into one string
close FILEREAD;
print "$x/n";
while(<$x>) {
chomp;
s/\<[^<>]*\>//g; # eliminate markup
tr/[A-Z]/[a-z]/; # downcase
s/([a-z]+|[^a-z]+)/\1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/\#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up white space (so it's always exactly one space between words, no newlines
s/^\s+//;
s/\s+$/ /;
print if(m/\S/); # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
}}
答案 0 :(得分:1)
您真的不需要编辑原始脚本以将其应用于目录的内容。在这种情况下,shell将成为你的朋友。
您的第一个脚本将读取作为参数传递的每个文件,或默认情况下读取stdin
的内容。换句话说,您可以像这样调用原始脚本:
$ ./script file > output
$ cat file | ./script | less
如果要解析所有文件,仍然可以使用shell:
$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"
这个简短的例子可能更清楚:
考虑一个名为script
的类似脚本:
#!/usr/bin/perl
while(<>) {
chomp
print ">$_<\n";
}
print "\n";
现在,从你的shell你可以做到:
$ mkdir foo && cd foo
$ echo -e "Hello\nYou\nI am A" >> a.txt
$ echo -e "Hello\nYou\nI am A" >> b.txt
$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"
$ ls
a.txt a.txt.out b.txt b.txt.out script script.out
$ cat a.txt.out
>Hello<
>You<
>I am A<
答案 1 :(得分:1)
您的主要问题是您打开每个文件并将其内容读入$x
,然后将$x
作为文件句柄传递给原始循环。但它不是文件句柄 - 它只是纯文本。如果您只是省略了对文件的读取,那么您的代码即将开始工作
我认为这会像你问的那样做。它使用glob
优先于opendir
/ readdir
,因为它更简洁
#!/usr/local/bin/perl
use strict;
use warnings;
while ( my $file = glob '1999_txt/*' ) {
next unless -f $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( <$fh> ) {
chomp;
s/<[^<>]*>//g; # Remove HTML tags
tr/A-Z/a-z/; # downcase
s/([a-z]+|[^a-z]+)/$1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up whitespace
s/^\s+//; # so it's always exactly one space
s/\s+$//; # between words, no newlines
print if /\S/; # print what's left if it's not just whitespace
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
}