Question

我有一个巨大的XML文件（大约10 Gb），我需要转换为CSV。现在这个文件将包含许多客户的信息。我必须将其转换为CSV格式。问题是许多客户将拥有其他客户不会使用的额外字段，并且将重复某些字段。 XML的例子是：

<customer>
<customerID>1</customerID>
    <auc>
        <algoId>0</algoId>
        <kdbId>1</kdbId>
        <acsub>1</acsub>
    </auc>
</customer>

<customer>
<customerID>2</customerID>
    <auc>
        <algoId>0</algoId>
        <kdbId>1</kdbId>
        <acsub>1</acsub>
        <extraBit>12345</extraBit>
    </auc>
    <auc>
        <algoId>2</algoId>
        <kdbId>3</kdbId>
        <acsub>3</acsub>
        <extraBit>67890</extraBit>
    </auc>
        <customOptions>
            <odboc>0</odboc>
    <odbic>0</odbic>
    <odbr>1</odbr>
    <odboprc>0</odboprc>
    <odbssm>0</odbssm>
</customOptions>
</customer>

现在你可以看到第一个客户只有1个auc块，但第二个有2个，而且它还有一个额外的标签，在auc中是extraBit。现在的问题是：

我应该一次处理一个客户（从一个客户到/客户，然后等等），因为10 Gb atonce会使系统崩溃。
我尝试在循环中使用XML TWIG，当我尝试为客户1提供extraBit时，它会终止“未定义值”的程序：

print $ customer-＆gt; first_child（'extraBit'） - ＆gt; text（）

无法在xml-tags.pl第50行的未定义值上调用方法“text”。
对于客户的额外auc值，我希望它们在CSV文件中输出为：

的customerID，algoId，kdbId，acsub，extraBit，algoId2，kdbId2，acsub2，extraBit2

1,0,1,1 ,,,,,,

2,0,1,1,1234,2,3,3,67890

Answer 1

print $customer->first_child('extraBit')->text()

您可以使用first_child_text来避免未定义的错误，如果找不到匹配的子元素，则定义为返回空字符串。

print $customer->first_child_text('extraBit')

完整的代码就像

my $t= XML::Twig->new(
  twig_handlers => { customer => \&process_customer });
$t->parsefile('file.xml');

sub process_customer {
  my ($t, $customer) = @_;
  print $customer->first_child_text('customerID');
  foreach my $auc ($customer->children('auc')) {
    print ',', $auc->first_child_text('algoId'),
          ',', $auc->first_child_text('kdbId'),
          ',', $auc->first_child_text('acsub'),
          ',', $auc->first_child_text('extraBit');
  }
  print "\n"
  $customer->purge;
}

Perl XML :: Twig。巨大的文件处理。如何处理重复的肠痈和不存在的企业

1 个答案: