尝试从String中删除HTML标记(+内容)

时间:2014-01-27 11:57:54

标签: php regex

好的,所以基本上我就要用这个把头撞到墙上了。

以下是代码:

<?php

$s = "385,178<ref name=\"land area\">Data is accessible by following \"Create tables and diagrams\" link on the following site, and then using table 09280 \"Area of land and fresh water (km²) (M)\" for \"The whole country\" in year 2013 and summing up entries \"Land area\" and \"Freshwater\": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>";

function removeHTMLTags($str) { 
    $r = '/(\\<br\\>)|(\\<br\/\\>)|(\\<(.+?)(\\s*[^\\<]+)?\\>(.+)?\\<\\\\\/\\1\\>)|(\\<ref\\sname=([^\\<]+?)\/\\>)/';

    echo "Preg_matching : $str\n\n";
    echo "Regex : $r\n\n";

    return preg_replace($r,'',$str); 
}

echo removeHTMLTags($s);

?>

我正在尝试做的,基本上是摆脱<ref name="... </ref>部分(以及所有可能的标签)

然而,这就是我所得到的

(a.k.a。完全相同的字符串,没有任何内容被替换):

Preg_matching : 385,178<ref name="land area">Data is accessible by following "Create tables and diagrams" link on the following site, and then using table 09280 "Area of land and fresh water (km²) (M)" for "The whole country" in year 2013 and summing up entries "Land area" and "Freshwater": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>

Regex : /(\<br\>)|(\<br\/\>)|(\<(.+?)(\s*[^\<]+)?\>(.+)?\<\\\/\1\>)|(\<ref\sname=([^\<]+?)\/\>)/

385,178<ref name="land area">Data is accessible by following "Create tables and diagrams" link on the following site, and then using table 09280 "Area of land and fresh water (km²) (M)" for "The whole country" in year 2013 and summing up entries "Land area" and "Freshwater": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>

所以,问题是:我做错了什么? (我已经多次使用RegExr对正则表达式进行了测试,它似乎确实有效 - 我是否正在使用......逃脱?)


P.S。对于那些知道我在说什么的人:是的,这是维基百科信息框的一部分。

1 个答案:

答案 0 :(得分:2)

你真的应该将DOM用于这种东西,因为其他解决方案容易破解:

$dom = new DOMDOcument();
$errorState = libxml_use_internal_errors(true);
$dom->loadHTML($s);

$xpath = new DOMXPath($dom);
$node = $xpath->query('//body/p/text()')->item(0);
echo $node->textContent;

libxml_use_internal_errors($errorState);