试图在Perl中解析XML,但是长数据字符串会被截断

时间:2011-06-06 10:30:50

标签: xml perl parsing

我尝试使用XML :: Simple和XML :: Twig解析XML文件,结果相同。文件中的其他字段工作得很好。

可以在此处检索相关文件:

curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"

这是解析器还是文件的问题?两个解析器的输出相同。字符串中的HTML标记存储在XML

输入字段(在名为'summary'的xml-tags内):

<summary type="html">&lt;p&gt;Toxoplasmosis is a disease caused by the parasite &lt;em&gt;Toxoplasma gondii&lt;/em&gt;. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.&lt;/p&gt;&#xd;^I&#xd;&lt;p&gt;You can get toxoplasmosis from &lt;/p&gt;&#xd;&lt;ul&gt;&#xd;&lt;li&gt;^IWaste from an infected cat&lt;/li&gt;&#xd;&lt;li&gt;^IEating contaminated meat that is raw or not well cooked &lt;/li&gt;&#xd;&lt;li&gt;^IUsing utensils or cutting boards after they've had contact with raw meat &lt;/li&gt;&#xd;&lt;li&gt;^IDrinking infected water &lt;/li&gt;&#xd;&lt;li&gt;^IReceiving an infected organ transplant or blood transfusion&lt;/li&gt;&#xd;&lt;/ul&gt;&#xd;&lt;p&gt;Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. &lt;/p&gt;&#xd;&#xd;&lt;p class="NLMattribution"&gt;Centers for Disease Control and Prevention&lt;/p&gt;</summary>

XML解析后的输出:

<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>

解决问题的方法: XML文件包含回车符“ “这会导致解析器出现问题。在我下载XML文件后,我使用以下行删除了回车:

sed -i 's/&#xd;//g' *.xml

解析器现在按预期工作。

更新 回车不会影响解析器,只会影响截断和混合的输出。然而,删除它确实解决了我的问题。

1 个答案:

答案 0 :(得分:2)

在将curl解析为管道(使用XML::Twig->new->parse( curl -s "http://..." |)时,我确实得到了一些奇怪的结果:内容显示为截断,从调用更改为调用...

如果我解析从curl结果或XML :: Twig的本地parseurl方法创建的文件,然后结果是常量,以及你想要的东西,事情看起来会更好:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig    = XML::Twig->new->parseurl( "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130" );
my $summary = $twig->first_elt( 'summary');

print $summary->text, "\n";

老实说,我不知道为什么会这样。我会尝试再研究一下,但我怀疑我无能为力:如果问题出现在XML :: Simple和XML :: Twig中,那么它可能位于堆栈的较低级别,XML :: Parser或expat以及他们与curl的互动。