我正在使用XMLReader和PHP来处理中等大小的XML文件(6mb),并且基本上将属性数据分解并将其插入到我自己的数据库中。问题是,每个元素都有可变数量的子元素,这些子元素具有相同的命名属性。
这是一个例子(这是关于政府礼貌govtrack.us的公开数据):
<?xml version="1.0" ?>
<people>
<person id='400001' lastname='Abercrombie' firstname='Neil' birthday='1938-06-26' gender='M' pvsid='26827' osid='N00007665' bioguideid='A000014' metavidid='Neil_Abercrombie' youtubeid='hawaiirep1' name='Rep. Neil Abercrombie [D, HI-1]' title='Rep.' state='HI' district='1' >
<role type='rep' startdate='1985-01-03' enddate='1986-10-18' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1991-01-03' enddate='1992-10-09' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1993-01-05' enddate='1994-12-01' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1995-01-04' enddate='1996-10-04' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1997-01-07' enddate='1998-12-19' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1999-01-06' enddate='2000-12-15' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='2001-01-03' enddate='2002-11-22' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='2003-01-07' enddate='2004-12-09' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2005-01-04' enddate='2006-12-08' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2007-01-04' enddate='2009-01-03' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2009-01-06' enddate='2010-03-01' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
</person>
我不需要对属性执行任何奇特的逻辑。在我的脚本开头,我检查是否已经处理了这个特定的记录(基于'id'属性),然后我抓住了几乎每个属性并将其解析到我的数据库中。但有两个问题:
1)当我使用它时:
$p->getAttribute('id')
获取'id',它给了我两次,与元素中的子元素一样多的换行符分开(我认为this page上的评论说明了这一点,但我不是确定该怎么办)。
2)如何按顺序访问每个子元素的属性?这样:
$p->getAttribute('startdate')
给我每个'startdate'值,由多个换行符分隔。我只需要获取元素的id,然后遍历每个'role'子元素。
有什么想法吗?
对于启发,这是我到目前为止的超简单控制器:
$f = base_url().'data/people.xml';
$p = new XMLReader;
$p->open($f);
while($p->read())
{
if($this->_notImported('govtrack',$p->getAttribute('id')))
{
// here I just grab the attributes, put them into arrays to insert, like so:
$insert = array('indiv_name' => $full_name,
'indiv_first' => ($p->getAttribute(‘firstname’)),
'indiv_last' => ($p->getAttribute(‘lastname’)),
'indiv_middle' => ($p->getAttribute(‘middlename’)),
'indiv_other' => ($p->getAttribute(‘namemod’)),
'indiv_full_name' => $full_name,
'indiv_title' => ($p->getAttribute(‘title’)),
'indiv_dob' => ($p->getAttribute(‘birthday’)),
'indiv_gender' => ($p->getAttribute(‘gender’)),
'indiv_religion' => ($p->getAttribute(‘religion’)),
'indiv_url' => ($url)
);
对于元素,这并不困难,但我不知道如何循环遍历每个“角色”子元素并分别获取属性。
答案 0 :(得分:3)
您的第一个问题是您没有检查相应的 nodeType ,这实际上与您链接的注释有关:它与开始标记(ELEMENT)和结束标记相匹配( END_ELEMENT)。
您的第二个问题也与缺少的 nodeType 检查有关。解决之后,您只需检查节点的名称,以确定它是<role>
还是<person>
。
由于我假设您还在阅读大型XML文件,因此您可能想知道何时传递给下一个人物标签...(通过END_ELEMENT nodeType )请参阅我的例子如下:
while($p->read()) {
// check for nodeType here (opening tag only)
if ($p->nodeType == XMLReader::ELEMENT) {
if ($p->name == 'person') {
if ($this->_notImported('govtrack',$p->getAttribute('id'))) {
// $insert['indiv_*'] stuff here
} else {
$insert = null; // skip record because it's already imported
}
} else if ($p->name == 'role') {
// role stuff here
$startdate = $p->getAttribute('startdate');
}
// check for closing </person> tag here
} else if ($p->nodeType == XMLReader::END_ELEMENT && $p->name == 'person') {
if (isset($insert)) {
// db insert here
}
}
}
顺便说一下,如果您希望这样做,那么您的引号‘
必须替换为正确的引号'
。