使用DOMXPath保留<p>标记内的换行符?</p>

时间:2011-01-19 19:44:44

标签: php html dom xpath

我目前正在使用PHP和DOMXPath来获取网页中所有<p>元素的内容:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

我的问题是textContent生成的字符串不尊重<br />元素中存在的<p>标记。相反,它会删除换行符并将单词推到一起,这些单词通常位于不同的行上。例如:

示例HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

以上PHP的当前输出:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

我也试过了$paragraphs = $doc->getElementsByTagName("p");,它给了我同样的东西。

有没有办法让DOMXPath / DOMDocument保留换行符?我需要能够分隔一个段落中的每个单词,而当前的输出不允许这样做。

如果有另一种方法可以检索<p>元素中的字符串,同时保留<br />'\n'也很棒。

修改


经过进一步调查,有问题的HTML实际上是由<br>标签分隔但没有实际换行符的锚点列表:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

事实证明,这适用于给定的原始HTML。

更新:已解决


在@ ircmaxell的答案的帮助下,以及@netcoder和@Gordon留下的评论已经解决了,它不是很优雅,但现在还可以。

示例:

foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

这使用DOMinnerHTML(由@netcoder建议)将<br>的实例替换为“\ r \ n”(由@ircmaxell建议),然后可以对其进行评估textContent.

显然还有一些改进空间,但它已经解决了我目前的问题。

感谢大家的帮助,

3 个答案:

答案 0 :(得分:4)

好吧,我要做的是用文字换行替换换行符:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}

答案 1 :(得分:2)

其中一种可能性

echo simplexml_import_dom($paragraph)->asXML();

答案 2 :(得分:1)

我有相同的情况,我使用:

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

我使用urlencode()将其更改为显示或插入数据库。