Question

我希望你能帮助我。 XML文件如下所示：

<channel><item>
<description>
<div>  <a href="http://image.com">
<span>   
<img src="http://image.com" /> 
</span>
</a>
Lorem Ipsum is simply dummy text of the printing etc... 
</div>
</description>
</item></channel>

我可以获取description标签的内容，但是当我这样做时，我得到了整个结构，其中有很多css，我不希望这样。我真正需要的是解析href链接和Lorem Ipsum文本。我正在尝试使用简单的XML，但无法找到，看起来太复杂了。有什么想法吗？

修改我用来解析xml的代码

$file = new SimpleXMLElement($mydata);
{

    foreach($file->channel->item as $post)
{

    echo $post->description; } }

Answer 1

该XML看起来非常像RSS或Atom提要（或从一个提取）。 description节点通常会被转义，或放在标有<![CDATA[ ... ]]>的部分内，这表示其内容将被视为原始文本，即使它们包含<，{{1 }，或>。

您的示例并未表明这一点，但如果您的&向您提供了整个内容，包括echo代码等，那么这就是正在发生的事情，您的问题类似于{{ 3}} - 您需要获取整个img内容，并将其解析为自己的文档。

如果由于某种原因HTML没有被转义，并且实际上被包含在XML中的一堆子节点中，那么可以直接访问链接的URL（假设结构始终是一致的）：

description

对于文本，如果你使用echo (string)$post->description->div->a['href'];（(string)自动转换为字符串“强制转换为字符串”，SimpleXML将连接特定元素的所有文本内容（但不会连接其子元素）但我猜你最终会想要做echo以外的其他事情。

在您的示例中，您想要的文本位于第一个（也是唯一的）div中，因此会显示它：

echo

但是，你提到了“很多CSS”，我猜你为了简单起见而遗漏了你的例子，所以我不确定你的真实内容是多么一致。

Answer 2

这将变得复杂。 ~~你没有XML，但是html。一个区别是标签不能包含另一个标签和XML中的一些文本。这就是为什么~~我使用PHP的DOM（我还没有使用它，但它类似于纯JavaScript）。

这就是我一起攻击的（未经测试）：

// first create our document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML("your html here"); // there is also a loadHTMLFile

// this tries to get an a element which has a href and returns that href
function getAHref ( $doc ) {
    // now get all a elements to get the one with a href
    $aElements = $doc->getElementsByTagName( "a" );
    foreach ( $aElements as $a ) {
        // has this element a href? than return
        if ( $a->hasAttribute( "href" ) ) {
            return $a->getAttribute( "href" );
        }
    }
    // failed? return false
    return false;
}

// tires to get the text in the node
// in your example the text isn't wrapped in anything so this is going to be difficult
function getTextFromNode ( $doc ) {
    // get and loop all divs (assuming the text is always a child of a div)
    $divs = $doc->getElementsByTagName( "div" ); // do we know it's always in that div?
    foreach ( $divs as $div ) {
        // also loop all child nodes to get the text nodes
        foreach ( $div->childNodes as $child ) {
            // is this a text node?
            if ( $child->nodeType == XML_TEXT_NODE ) {
                // is there something in it (new lines count as text nodes)
                if ( trim( $child->nodeValue ) != "" ) {
                    // *pfew* got it
                    return $child->nodeValue;
                }
            }
        }
    }
    // failed? return false
    return false;
}

Answer 3

这是回答问题的最终代码。

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }
    foreach($text as $t){

        echo (string)$t;
       }
    }

这是IMSoP的代码，我添加了$text = $description_sxml->xpath('//div');来读取<div>内的文本。

在我的情况下，xml中的一些帖子有多个<div>和<span>标记，因此要解析所有帖子，我可能需要为{{1}添加另一个->xpath或者可能是<span>语句，这样如果我在if... else内没有任何内容，请回显<div>内容。谢谢你的回复。

如何从复杂的xml中解析文本和图像

3 个答案: