取消关联图片,删除未关闭的图片并删除所有样式

时间:2014-09-30 15:59:55

标签: php html html-parsing domdocument

我的Wordpress帖子有些问题,我正在尝试使用DOMDocument修复它们。

第一个问题是我的图片(<img>位于<a>标记内,我想删除<a>标记。

我还想删除所有未公开的<p>标记(没有</p>),我想从所有元素中删除style

我可以发布一些我尝试过的代码,但我认为它根本不会有用,因为我无处可去。我现在只尝试从图像中删除链接,但似乎没有任何效果。我真的不太了解如何使用DOMDocument子元素。

在这里,您可以看到需要修复的HTML示例:

<img width="750" height="500" src="http://fancycribs.com/wp-content/uploads/2013/05/Modern-Riverside-Apartment-–-A-Stylish-and-Elegant-Residence-6.jpg" class="attachment-large wp-post-image" alt="Modern Riverside Apartment – A Stylish and Elegant Residence (6)" />        <p>This modern seventh floor riverside apartment is placed in the luxurious and modern Montevetro Building, which is close to Battersea Square with access to Chelsea, Fulham and Kings Road by crossing Battersea Bridge, London. This residence has become one of the iconic buildings in the Battersea area.</p>
<p>It offers spectacular views over the serene tranquility of the river. This apartment offers comfort and luxury throughout its double reception room, three bedrooms, three bathrooms and large decked balcony. The design details are astonishing: mahogany wood floors, original hand painted walls, large floor to ceiling windows offering a spectacular view over the river. The apartment is spacious, the space between living room and dining room is fluid, having continuity. The hall is large and has a lot of storage spaces, having the quality to link rooms one to another. The kitchen space is large and has plenty of storage capacity. It is dressed up in mahogany wood, offering personality and contrast and access to the large balcony.</p>
<p>The master bedroom is a masterpiece of style and elegance, with nice and simple furniture, a bathroom and accompanied by two further double bedrooms, a family bathroom and a shower room. The residence overwhelms you through its luxury and the splendid view.</p>
<p style="text-align: center"><a href="http://fancycribs.com/37216-modern-riverside-apartment-a-stylish-and-elegant-residence.html/modern-riverside-apartment-a-stylish-and-elegant-residence-7" rel="attachment wp-att-39033" class="local-link"><img class="aligncenter size-medium wp-image-39033" alt="Modern Riverside Apartment – A Stylish and Elegant Residence" src="http://fancycribs.com/wp-content/uploads/2013/05/Modern-Riverside-Apartment-–-A-Stylish-and-Elegant-Residence-7-670x446.jpg" width="670" height="446" title="Modern Riverside Apartment – A Stylish and Elegant Residence" /></a></p>
<p style="text-align: center">

稍后编辑:

这是我尝试过的,它似乎取消了图像链接,但只有图像编号1,3,5,7,而2,4,6保持不变。

$html = new DOMDocument;
$html->preserveWhiteSpace = false;
$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$content);
foreach($html->getElementsByTagName('a') as $a) {
    if($a->hasChildNodes()) {
        $img = $a->getElementsByTagName('img')->item(0);
        $a->parentNode->replaceChild($img,$a);
    }
}
$text = $html->saveHTML();
echo $text;

谢谢

2 个答案:

答案 0 :(得分:0)

我已经设法使用DOMDocument和HTML Purifier。

以下是代码:

require_once 'library/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.TidyLevel','heavy');
$config->set('AutoFormat.RemoveEmpty','true');
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp','true');
$purifier = new HTMLPurifier($config);

$clean_html = $purifier->purify($content);
$html = new DOMDocument;
$html->preserveWhiteSpace = false;
$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$clean_html);
$as = $html->getElementsByTagName('a');
$ctr = $html->getElementsByTagName('a')->length;
for($i=$ctr;$i>0;--$i) {
    $a = $html->getElementsByTagName('a')->item($i-1);
    if($a->hasChildNodes()) {
        $img = $a->getElementsByTagName('img')->item(0);
        if($img != null) {
            $a->parentNode->replaceChild($img,$a);
        }
    }
}

foreach($html->getElementsByTagName('p') as $p) {
    $p->removeAttribute('style');
}
$text = $html->saveHTML();
echo $text;

答案 1 :(得分:-1)

您可以尝试运行此代码,看看您是否满意。这会找到<a ...><img ... and replaces it to just <img ...

$p = "/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*<img.*)<\/a>/siU";
$newHtml = preg_replace($p, '$3', $html , PREG_SET_ORDER );