寻求帮助:
我有一个目录,里面装满了用数字ID命名的文本文件。每个文本文件都包含新闻文章的正文。一些新闻文章在不同的部分被隔离,因此它们在不同的文本文件中。
名字是这样的
1001_1.txt, 1001_2.txt (These files contain two different part of the same article) 1002_1.txt, 1003_1.txt, 1004_1.txt, 1004_2.txt, 1004_3.txt, 1004_4.txt (these files contain four different parts of the same article, the parts will go up to a maximum of 4 only).
依此类推。
基本上,我需要一个简单放置的脚本(PHP,Perl,RUBY或其他) 列中文本文件的名称(下划线之前),以及 另一列中的文本文件的内容,以及是否有任何数字 在下划线之后,也将它放在一列中。
所以你会有一个像这样的表结构:
1001 | 1 | content of the text file
1001 | 2 | content of the text file
1002 | 1 | content of the text file
1003 | 1 | content of the text file
任何有关如何实现这一目标的帮助将不胜感激。
需要读取和导入大约7000个文本文件 用于将来在数据库中使用的表。
如果_1和_2文件内容可能会更好 分隔在不同的列中,例如:
1001 | 1 | content | 2 | content | 3 | content | 4 | content
1002 | 1 | content
1003 | 1 | content
(就像我说的,文件名最多可达_4
所以你可以拥有1001_1
,1001_2
,1001_3
,1001_4.txt
或1002_1
和1003_1.txt
)
答案 0 :(得分:2)
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use File::Slurp;
die "Need somewhere to start\n" unless @ARGV;
my %files;
find(\&wanted, @ARGV);
for my $name (sort keys %files) {
my $file = $files{$name};
print join( ' | ', $name,
map { exists $file->{$_} ? ($_, $file->{$_}) : () } 1 .. 4
), "\n";
}
sub wanted {
my $file = $File::Find::name;
return unless -f $file;
return unless $file =~ /([0-9]{4})_([1-4])\.txt$/;
# I do not know what you want to do with newlines
$files{$1}->{$2} = join('\n', map { chomp; $_ } read_file $file);
return;
}
输出:
1001 | 1 | lsdkjv\nsdfljk\nsdklfjlksjadf\nlsdjflkjdsf | 3 | sadlfkjldskfj 1002 | 1 | ldskfjsdlfjkl
答案 1 :(得分:1)
use strict;
use warnings;
my %content;
while (<>){
s/\s+/ /g;
my ($f, $n) = $ARGV =~ /(\d+)_(\d)\.txt$/;
$content{$f}{$n} .= $_;
}
for my $f (sort keys %content){
print join('|',
$f,
map { $_ => $content{$f}{$_} } sort keys %{$content{$f}},
), "\n";
}
答案 2 :(得分:0)
可能不是最佳,但可能是你的出发点(有意评论):
#!/usr/bin/perl
use strict;
use warnings;
# results hash
my %res = ();
# foreach .txt files
for (glob '*.txt') {
s/\.txt$//; # replace suffix .txt by nothing
my $t = ''; # buffer for the file contents
my($f, $n) = split '_'; # cut the file name ex. 1001_1 => 1001 and 1
# read the file contents
{
local $/; # slurp mode
open(my $F, $_ . '.txt') || die $!; # open the txt file
$t = <$F>; # get contents
close($F); # close the text file
}
# transform \r, \n and \t into one space
$t =~ s/[\r\n\t]/ /g;
# appends for example 1001 | 2 | contents of 1001_2.txt to the results hash
$res{$f} .= "$f | $n | $t | ";
}
# print the results
for (sort { $a <=> $b } keys %res) {
# remove the trailing ' | '
$res{$_} =~ s/\s\|\s$//;
# print
print $res{$_} . "\n";
}
# happy ending
exit 0;