在XML文件中搜索并添加单词的文本文件

时间:2017-05-13 10:24:09

标签: perl

我有两个文件。第一行中每一行都有一系列单词

bus do car
car tree

第二个文件是XML文件

<title>i have a car. i take bus..</title>

我想在文本文件中搜索XML文件中的每个单词。如果找到,我想插入文本文件中出现的所有行,其中任何空格都替换为x

结果文件是

<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>

我试试这个

use strict;
use warnings;
use autodie; 

my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;

open NewQueryFile ,">./NewQuery.txt"
    or die "Cannot create NewQuery.txt";

open(my $fh,$SequenceFile)
    or die "$SequenceFile : $!";

while ( <$fh> ) {
    chop;
    s/^\s+|\s+$//g;
    my $h = \%hashlist;
    foreach ( split('\s+', $_) ) {
        $h->{$_} //= {};        
        $h = $h->{$_}; 
    }
    $h->{'#'} = 1;
}
close $fh;

open(my $fd, $QueryFile)
    or die "$QueryFile : $!";

for my $xml (<$fd>) {  
    foreach my $line (split(/\n/, $xml)) {
        my @words = split(/\s/, $line);
        if $words = @hashlist[$_] {
            print NewQueryFile join ('x',$words) ;
        }
    }
}

close NewQueryFile ;

close($fd);

1 个答案:

答案 0 :(得分:0)

我已经整理了一个快速脚本,以表明人们可能会如何解决这个问题。

我没有对xml感到困扰,因为这可能让我心情不好。

我的建议是:使用变量,无论你从不这样做而节省了什么都会丢失,因为你的代码会让人感到困惑,然后就会出错。

#!/usr/bin/env perl

use strict;
use warnings;

# Notes:

# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
#   it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
#   Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
#   parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested

my %query_words;

my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'"; 

# build list of words from query file

while ( <$fh> ){
    chomp;

    # Words mentioned in line.
    my @words = split(/\s+/,$_);

    # Words joined by 'x'. Seems a strange choice *shrug*.
    # This is used to replace words with later.
    my $line  = join("x",@words);

    # Storing in arrayref, this seems easier to me 
    # than concatening repeatedly and getting the spaces right.
    for my $word ( @words ){
        push @{$query_words{$word}}, $line;
    }
}



# Expects the text to replace.
# Returns the text to replace it with.

sub replace_xml_text {
    my $original_text = shift;

    my @words;

    for my $word ( split(/\s+/,$original_text) ){

        my $punctuation = '';

        # Remove punctuation before matching,
        # but do preserve it.
        if ( $word =~s /(\s*[,.]\s*)$// ){
            $punctuation = $1;
        }

        if ( my $additions = $query_words{$word} ){
            $word = join(" ",$word,@$additions);
        }

        # Put punctuation back.
        $word .= $punctuation;

        # Done replacing in this word, next
        push @words,$word;
    }

    return join(" ",@words);
}