尝试使用PHP从XML中提取标记内容

时间:2013-01-22 16:01:18

标签: php xml nodes

我们在我们的机构使用Acalog,并希望使用他们的(不支持的)API将目录内容从他们的网站中提取到我们的网站。我可以访问他们的文件并提取信息,但格式(段落,粗体,斜体,中断)是作为节点完成的(h:p,h:b,h:i,h:br)。不幸的是,我从搜索内容中删除的文本只带来了直接文本,并且不包含格式化节点。如何将带有文本的节点带入?我哪里错了?

XML的开头(我把它打破了大约一半的标记)

<catalog xmlns="http://acalog.com/catalog/1.0" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:a="http://www.w3.org/2005/Atom" xmlns:xi="http://www.w3.org/2001/XInclude" id="acalog-catalog-6">
<hierarchy>
    <legend>
        <key id="acalog-entity-type-5">
            <name>Department</name>
            <localname>Department</localname>
        </key>
    </legend>
    <entity id="acalog-entity-239">
        <type xmlns:xi="http://www.w3.org/2001/XInclude">
            <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" xi:xpointer="xmlns(c=http://acalog.com/catalog/1.0) xpointer((//c:key[@id='acalog-entity-type-5'])[1])"/>
        </type>
        <a:title xmlns:a="http://www.w3.org/2005/Atom">American Studies</a:title>
        <code/>
        <a:content xmlns:a="http://www.w3.org/2005/Atom" xmlns:h="http://www.w3.org/1999/xhtml">
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">
                    <h:i>Chair of the Department of American Studies: </h:i>
                </h:span>
                <h:span class="dept_intro">John Smith</h:span>
                <h:br/>
                <h:span class="dept_intro"> 
                    <h:br/>&#xD;
                    Professors: Jane Smith; Sarah Smith, <h:i class="dept_intro">The Douglas Family Chair in American Culture, History, and Literary and Interdisciplinary Studies</h:i>
                    <h:br/><h:br/>&#xD;Associate Professor: Michael Smith
                </h:span>
                <h:span class="dept_intro"><h:br/></h:span>
            </h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Assistant Professor: Rebecca Smith</h:span>
            </h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Lecturer: * Leonard Smith</h:span></h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Visiting Lecturer: * Robert Smith<h:br/><h:br/><h:br/><h:br/></h:span><h:strong>Department Overview</h:strong></h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml" class="MsoNormal">American studies is an  interdiscipl

这是我到目前为止编写的代码:

$xml = file_get_contents($url);
    if ($xml === false) {
        return false;
    } else {
        // Create an empty DOMDocument object to hold our service response
        $dom = new DOMDocument('1.0', 'UTF-8');
        // Load the XML
        $dom->loadXML($xml);
        // Create an XPath Object
        $xpath = new DOMXPath($dom);
        // Register the Catalog namespace
        $xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml');
        $xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
        $xpath->registerNamespace('xi', 'http://www.w3.org/2001/XInclude');
        // Check for error
        $status_elements = $xpath->query('//c:status[text() != "success"]');
        if ($status_elements->length > 0) {
            // An error occurred
            return false;
        }
        $x = $dom->documentElement;
        foreach ($x->childNodes AS $item)
          {
          //echo $item->nodeName . " = " . $item->nodeValue . "<br/><br />";
          }
        // Retrieve all catalogs elements
        $pageText = $xpath->query('//a:content');
        if ($pageText->length == 0) {
            // No text found
            return false;
        }

        foreach ($pageText AS $item) {
            $txt = (string) $item->nodeValue;
            $txt = str_replace('<h:i>','<i>',$txt);
            $txt = str_replace('</h:i>','</i>',$txt);
            $txt = str_replace('<h:span class="dept_intro">','<p>',$txt);
            $txt = str_replace('</h:span>','</p>',$txt);
            if(strpos($txt,'Department Overview')) {
                echo '<p>' . str_replace('Department Overview','',$txt) . '</p>';
                break;  
            } else {
                echo '<p>' . $txt . '</p>';
            }
            //echo $pageText->nodeValue;
        }
    }

行$ pageText = $ xpath-&gt; query('// a:content');拉取内容,但不是标签。

0 个答案:

没有答案