我们在我们的机构使用Acalog,并希望使用他们的(不支持的)API将目录内容从他们的网站中提取到我们的网站。我可以访问他们的文件并提取信息,但格式(段落,粗体,斜体,中断)是作为节点完成的(h:p,h:b,h:i,h:br)。不幸的是,我从搜索内容中删除的文本只带来了直接文本,并且不包含格式化节点。如何将带有文本的节点带入?我哪里错了?
XML的开头(我把它打破了大约一半的标记)
<catalog xmlns="http://acalog.com/catalog/1.0" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:a="http://www.w3.org/2005/Atom" xmlns:xi="http://www.w3.org/2001/XInclude" id="acalog-catalog-6">
<hierarchy>
<legend>
<key id="acalog-entity-type-5">
<name>Department</name>
<localname>Department</localname>
</key>
</legend>
<entity id="acalog-entity-239">
<type xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" xi:xpointer="xmlns(c=http://acalog.com/catalog/1.0) xpointer((//c:key[@id='acalog-entity-type-5'])[1])"/>
</type>
<a:title xmlns:a="http://www.w3.org/2005/Atom">American Studies</a:title>
<code/>
<a:content xmlns:a="http://www.w3.org/2005/Atom" xmlns:h="http://www.w3.org/1999/xhtml">
<h:p xmlns:h="http://www.w3.org/1999/xhtml">
<h:span class="dept_intro">
<h:i>Chair of the Department of American Studies: </h:i>
</h:span>
<h:span class="dept_intro">John Smith</h:span>
<h:br/>
<h:span class="dept_intro">
<h:br/>
Professors: Jane Smith; Sarah Smith, <h:i class="dept_intro">The Douglas Family Chair in American Culture, History, and Literary and Interdisciplinary Studies</h:i>
<h:br/><h:br/>
Associate Professor: Michael Smith
</h:span>
<h:span class="dept_intro"><h:br/></h:span>
</h:p>
<h:p xmlns:h="http://www.w3.org/1999/xhtml">
<h:span class="dept_intro">Assistant Professor: Rebecca Smith</h:span>
</h:p>
<h:p xmlns:h="http://www.w3.org/1999/xhtml">
<h:span class="dept_intro">Lecturer: * Leonard Smith</h:span></h:p>
<h:p xmlns:h="http://www.w3.org/1999/xhtml">
<h:span class="dept_intro">Visiting Lecturer: * Robert Smith<h:br/><h:br/><h:br/><h:br/></h:span><h:strong>Department Overview</h:strong></h:p>
<h:p xmlns:h="http://www.w3.org/1999/xhtml" class="MsoNormal">American studies is an interdiscipl
这是我到目前为止编写的代码:
$xml = file_get_contents($url);
if ($xml === false) {
return false;
} else {
// Create an empty DOMDocument object to hold our service response
$dom = new DOMDocument('1.0', 'UTF-8');
// Load the XML
$dom->loadXML($xml);
// Create an XPath Object
$xpath = new DOMXPath($dom);
// Register the Catalog namespace
$xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml');
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$xpath->registerNamespace('xi', 'http://www.w3.org/2001/XInclude');
// Check for error
$status_elements = $xpath->query('//c:status[text() != "success"]');
if ($status_elements->length > 0) {
// An error occurred
return false;
}
$x = $dom->documentElement;
foreach ($x->childNodes AS $item)
{
//echo $item->nodeName . " = " . $item->nodeValue . "<br/><br />";
}
// Retrieve all catalogs elements
$pageText = $xpath->query('//a:content');
if ($pageText->length == 0) {
// No text found
return false;
}
foreach ($pageText AS $item) {
$txt = (string) $item->nodeValue;
$txt = str_replace('<h:i>','<i>',$txt);
$txt = str_replace('</h:i>','</i>',$txt);
$txt = str_replace('<h:span class="dept_intro">','<p>',$txt);
$txt = str_replace('</h:span>','</p>',$txt);
if(strpos($txt,'Department Overview')) {
echo '<p>' . str_replace('Department Overview','',$txt) . '</p>';
break;
} else {
echo '<p>' . $txt . '</p>';
}
//echo $pageText->nodeValue;
}
}
行$ pageText = $ xpath-&gt; query('// a:content');拉取内容,但不是标签。