只使用PHP在html字符串中保留一些标签

时间:2014-02-10 04:49:45

标签: php simple-html-dom

我正在使用simple_html_dom抓取一个网站,并且需要的结果介于 - > innertext和 - > plaintext之间。

例如,这是源字符串:

<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:</span>

我需要删除span标记,但不删除其内容(除非span仅包含&nbsp;),但保留<i>,{{1} }和<u>

所以我想在这里实现的结果是一个字符串:

<b>

3 个答案:

答案 0 :(得分:0)

你可以试试这个。

echo stripcslashes('<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:</span>');

答案 1 :(得分:0)

您可以尝试以下代码行:

<?php

$string = '<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian tr
ade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive u
se of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> 
provides:</span>';

// Remove attributes within the <span> tag, just for clarity's sake.
$string = preg_replace('/(<span ([^\>]+)>)/i', '<span>', $string);

// Remove any spans that only contain &nbsp;
$string = preg_replace('/<span>([ ]|&nbsp;)*<\/span>/i', '', $string);

// Replace any consecutive span (opening or closing) tags with a space, to make
// clear the separation between one span and the next.
$string = preg_replace('/<(\/)?span><(\/)?span>/i', ' ', $string);

// Remove any remaining any instances of opening or closing span tags.
$string = preg_replace('/<(\/)?span>/i', '', $string);

print $string;

请注意,我在每个正则表达式的斜杠后面添加了一个i,这样可以进行不区分大小写的搜索。这是为了防止您有一些<SPAN><span>甚至<SpaN>的代码。

当然,它不是一个紧密压缩的单行正则表达式代码真棒。但是,我这样做是为了让你可以看到沿途的步骤。您可以在整个print $string;行中查看进度。我希望这种向您展示代码的方式可以帮助您从长远来看,更好地了解正则表达式和preg_replace的使用方式。

答案 2 :(得分:0)

这就是strip_tags的用途:

echo strip_tags('<span>strip me</span> <i>leave me alone</i>', '<i>');
//=> strip me <i>leave me alone</i>