使用正则表达式根据条件提取href描述

时间:2012-09-21 14:22:20

标签: php regex html-parsing preg-replace

  

可能重复:
  How to parse and process HTML with PHP?

我需要解析HTML块,根据描述是否符合特定条件,用链接描述替换一些href。

我用来识别特定字符串的正则表达式在我的应用程序的其他地方使用:

$regex  = "/\b[FfGg][\.][\s][0-9]{1,4}\b/";
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

我使用以下SO问题作为提取href描述的起点:

Replacing html link tags with a text description

我们的想法是转换任何具有“FfGg.xxxx”类型标识符的链接,并保留其余链接(即google链接)。

到目前为止我所拥有的是:

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD 
show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in 
severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.
</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case 
reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a 
href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a 
href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" 
target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a 
href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a 
href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" 
target="F.96">F.96</a>);';

这会转换所有链接,包括google:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>(.*?)<\/a>/i", "$2", $html);

这将返回一个空白的HTML字符串:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>[FfGg][\.][\s][0-9]{1,4}<\/a>/i", "$2", $html);

我认为问题在于我如何在上面的第二个(非工作)示例中嵌入此正则表达式:

[FfGg][\.][\s][0-9]{1,4}

在上面的preg_replace示例中找到的HTML中嵌入FfGg表达式的正确方法是什么?

3 个答案:

答案 0 :(得分:2)

您不应该使用正则表达式解析HTML。你根本无法正确处理所有情况。以下是有效HTML的一些示例,它们会破坏您的链接查找正则表达式:

<!-- <a href="www.blah.com">   -->    <a href="www.foo.com">F.100</a>
<area>...</area>  ...  <a href="www.foo.com">F.100</a>
<a href="www.foo.com">F.100</a >

我建议看看这个问题以获得更好的方法:How do you parse and process HTML/XML in PHP?

答案 1 :(得分:2)

以下是DOM(正确)方法:

编辑:改进的正则表达式

<?php

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" target="F.96">F.96</a>);';

    // Create a new DOMDocument and load the HTML string
    $dom = new DOMDocument('1.0');
    $dom->loadHTML($html);

    // Create an XPath object for this DOMDocument
    $xpath = new DOMXPath($dom);

    // Loop over all <a> elements in the document
    // Ideally we would combine the regex into the XPath query, but XPath 1.0
    // doesn't support it
    foreach ($xpath->query('//a') as $anchor) {
        // See if the link matches the pattern
        if (preg_match('/^\s*[gf]\s*\.\s*\d{1,4}\s*$/i', $anchor->nodeValue)) {
            // If it does, convert it to a text node (effectively, un-linkify it)
            $textNode = new DOMText($anchor->nodeValue);
            $anchor->parentNode->replaceChild($dom->importNode($textNode), $anchor);
        }
    }

    // Because you are working with partial HTML string, I extract just that
    // string. If you are actually working with a full document, you can
    // replace all the code below this comment with simply:
    // $result = $dom->saveHTML();

    // A string to hold the result
    $result = '';

    // Iterate all elements that are a direct child of the <body> and convert
    // them to strings
    foreach ($xpath->query('/html/body/*') as $node) {
        $result .= $node->C14N();
    }

    // $result now contains the modified HTML string

See it working(注意:您看到的错误消息是因为您提供的HTML字符串无效)

答案 2 :(得分:1)

你不应该如此依赖不情愿的量词。他们尝试消耗尽可能少的字符,但为了实现整体匹配,他们将消耗尽可能多的字符。如果HTML被缩小(特别是,如果它很少或没有换行),那些.*?中的每一个都可能最终尝试使用文档的其余部分,并且可能需要多次执行。

当不可能匹配时尤其如此;在它承认失败之前,它必须通过文本的每一条可能的路径。另一个问题是,不情愿的量词不会过早阻止启动的匹配。鉴于此字符串:

<a href="www.blah.com">...</a> <a href="www.foo.com">F.100</a>

...它将在第一个<a>标记处开始匹配,并在第二个标记结束时停止。在这个正则表达式:

'~<a\b[^>]*\bhref="[^"]*"[^>]*>([FG]\.\d{1,4})</a>~i'

...我已使用.*?[^>]*替换了每个[^"]*,以将匹配的这些部分分别限制为单个标记或属性值。虽然这个正则表达式的效果要好得多,但要注意它并非万无一失 - 远非如此。但是,当将HTML与正则表达式匹配时,它与您可以合理地获得的接近。