我有两个文件。第一行中每一行都有一系列单词
bus do car
car tree
第二个文件是XML文件
<title>i have a car. i take bus..</title>
我想在文本文件中搜索XML文件中的每个单词。如果找到,我想插入文本文件中出现的所有行,其中任何空格都替换为x
。
结果文件是
<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>
我试试这个
use strict;
use warnings;
use autodie;
my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;
open NewQueryFile ,">./NewQuery.txt"
or die "Cannot create NewQuery.txt";
open(my $fh,$SequenceFile)
or die "$SequenceFile : $!";
while ( <$fh> ) {
chop;
s/^\s+|\s+$//g;
my $h = \%hashlist;
foreach ( split('\s+', $_) ) {
$h->{$_} //= {};
$h = $h->{$_};
}
$h->{'#'} = 1;
}
close $fh;
open(my $fd, $QueryFile)
or die "$QueryFile : $!";
for my $xml (<$fd>) {
foreach my $line (split(/\n/, $xml)) {
my @words = split(/\s/, $line);
if $words = @hashlist[$_] {
print NewQueryFile join ('x',$words) ;
}
}
}
close NewQueryFile ;
close($fd);
答案 0 :(得分:0)
我已经整理了一个快速脚本,以表明人们可能会如何解决这个问题。
我没有对xml感到困扰,因为这可能让我心情不好。
我的建议是:使用变量,无论你从不这样做而节省了什么都会丢失,因为你的代码会让人感到困惑,然后就会出错。
#!/usr/bin/env perl
use strict;
use warnings;
# Notes:
# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
# it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
# Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
# parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested
my %query_words;
my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'";
# build list of words from query file
while ( <$fh> ){
chomp;
# Words mentioned in line.
my @words = split(/\s+/,$_);
# Words joined by 'x'. Seems a strange choice *shrug*.
# This is used to replace words with later.
my $line = join("x",@words);
# Storing in arrayref, this seems easier to me
# than concatening repeatedly and getting the spaces right.
for my $word ( @words ){
push @{$query_words{$word}}, $line;
}
}
# Expects the text to replace.
# Returns the text to replace it with.
sub replace_xml_text {
my $original_text = shift;
my @words;
for my $word ( split(/\s+/,$original_text) ){
my $punctuation = '';
# Remove punctuation before matching,
# but do preserve it.
if ( $word =~s /(\s*[,.]\s*)$// ){
$punctuation = $1;
}
if ( my $additions = $query_words{$word} ){
$word = join(" ",$word,@$additions);
}
# Put punctuation back.
$word .= $punctuation;
# Done replacing in this word, next
push @words,$word;
}
return join(" ",@words);
}