使用XML :: LibXML从XML文件中提取数据

时间:2017-05-31 11:50:44

标签: xml perl xml-libxml

我有一个包含数千个条目的XML文件

<mediawiki>
  <page>
    <title>page1</title>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text1</text>
    </revision>
  </page>
  <page>
    <title>page2</title>
    <ns>8</ns>
    <id>7</id>
    <revision>
      <id>2619</id>
      <parentid>2618</parentid>
      <timestamp>2005-10-09T00:56:39Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text2</text>
    </revision>
  </page>
  <page>
    <title>page3</title>
    <ns>8</ns>
    <id>6</id>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text3</text>
    </revision>
  </page>
</mediawiki>

通过我的脚本,每个页面必须是一个文本文件,其名称是标记<title>的内容,并包含<text xml:space="preserve"></text>

的文本

我的代码

my $filename = "pages.xml";
my $parser   = XML::LibXML->new();
my $xmldoc   = $parser->parse_file( $filename );
my $file;

foreach my $page ( $xmldoc->findnodes( '/mediawiki/page' ) ) {

    foreach my $title ( $page->findnodes( '/mediawiki/page/title' ) ) {

        foreach my $rev ( $page->findnodes( '/mediawiki/page/revision' ) ) {

            foreach my $text ( $rev->findnodes( 'text/text()' ) ) {

                $file = $title->to_literal();
                my $newfile = "$file.txt";

                open( my $out, '>:utf8', $newfile )
                        or die "Unable to open '$newfile' for write: $!";
                my $texte = $text->data;
                print $out "$text\n";
                close $out;
            }
        }
    }
}

问题是每个构建的文件都包含与最后一个标记<text xml:space="preserve"></text>

相同的文本

1 个答案:

答案 0 :(得分:1)

您的错误是嵌套所有for循环而不使用相对XPath表达式

这应该做你想要的事情

use utf8;
use strict;
use warnings 'all';
use feature 'say';

STDOUT->autoflush;

use XML::LibXML;

my $filename = "pages.xml";
my $doc      = XML::LibXML->load_xml( location => $filename );

for my $page ( $doc->findnodes('/mediawiki/page') ) {

    my ($title) = $page->findnodes('title');
    my $file = $title->textContent;

    my ($rev_text) = $page->findnodes('revision/text');
    my $text = $rev_text->textContent;

    open my $fh, '>:utf8', $file
        or die qq{Unable to open "$file" for output: $!};

    print $fh "$text\n";

    close $fh;

    say qq{File "$file" written with "$text"};
}

输出

File "page1" written with "text1"
File "page2" written with "text2"
File "page3" written with "text3"