Question

我有一个内部有一些超链接的字符串。我想与正则表达式只匹配来自所有这些的某些链接。我不知道href或班级是否排在第一位，可能会有所不同。这就是一个刺痛：

<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>

我想从aboce字符串中选择只有具有 nextpostslink 类的字符串因此，此示例中的匹配应返回此 -

<a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a>

这个正则表达式是我能得到的最接近的 -

/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/

但它正在从字符串的开头选择链接。我认为我的问题出在（。*），但我无法弄清楚如何更改此选项以仅选择所需的链接。

感谢您的帮助。

Answer 1

为此使用真正的HTML解析器要好得多。放弃所有在HTML上使用正则表达式的尝试。

使用PHP的DOMDocument代替：

$dom = new DOMDocument;
$dom->loadHTML($yourHTML);

foreach ($dom->getElementsByTagName('a') as $link) {
    $classes = explode(' ', $link->getAttribute('class'));

    if (in_array('nextpostslink', $classes)) {
        // $link has the class "nextpostslink"
    }
}

Answer 2

这适用于php：

/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m

这当然是假设class属性总是在href属性之后。

这是一段代码：

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    echo "URL: " . $matches[2] . "\n";
    echo "Text: " . $matches[6] . "\n";
}

但我会建议首先匹配链接，然后获取网址，以便属性的顺序无关紧要：

<?php

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    $link = $matches[0];
    $text = $matches[4];

    $regexp = "/href=(\"|')([^'\"]*)(\"|')/";
    $matches = array();
    if(preg_match($regexp, $html, $matches)) {
        $url = $matches[2];

        echo "URL: $url\n";
        echo "Text: $text\n";
    }
}

你当然可以通过匹配两个变体中的一个来扩展正则表达式（首先是类vs href）但是它会很长并且我认为它不会提高性能。

作为概念证明，我创建了一个不关心顺序的正则表达式：

/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m

文本将在第12组中，并且URL将位于组3或组10中，具体取决于订单。

Answer 3

无论你怎么努力，都不可能只使用正则表达式创建一个无错误的HTML解析器（不包括琐碎的问题或具有非常有限的输入集的问题（没有嵌套标签，双引号中没有单引号等））。

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Answer 4

不确定这是不是你的意思：用正则表达式解析html是个坏主意。使用xpath实现以获得所需的元素。以下xpath表达式将为您提供所有带有“nextpostlink”类的'a'元素：

//a[contains(@class,"nextpostslink")]

有大量的xpath信息，因为你没有提到你的编程语言这里是一个使用java的快速xpath教程：http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

修改

php + xpath + html：http://dev.juokaz.com/php/web-scraping-with-php-and-xpath

Answer 5

问题是要通过 regex 来获取，这是<a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>的方式。

属性的顺序无关紧要，它也考虑单引号或双引号。

在线检查正则表达式：https://regex101.com/r/DX03KD/1/

Answer 6

我将（。*）替换为[^'“] +，如下所示：

<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>

注意：我尝试使用RegEx Buddy，所以我不需要转义＆lt;＆gt;或/

正则表达式只匹配某些类的完整超链接

6 个答案: