Question

我想从一段文字中提取所有网址和标题。

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.

由于以下正则表达式，我能够获得所有href，但我不知道如何获得<a></a>标签之间的标题？

preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);

最好的方法是获得类似

的关联数组

[0] => Array
(
   [title] => XXX
   [link] => http://test.com/blop
)
[1] => Array
(
   [title] => XXX
   [link] => http://test.com
)

感谢您的帮助

Answer 1

如果您仍然坚持使用正则表达式来解决此问题，您可以使用此正则表达式解析一些：

<a.*?href="(.*?)".*?>(.*?)</a>

请注意，它不会像你那样使用U修饰符。

更新：要让它接受单个qoutes以及双引号，您可以使用以下模式：

<a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>

Answer 2

正如评论中提到的那样，不要使用正则表达式而是使用DOM解析器 E.g。

<?php
$doc = new DOMDocument;
$doc->loadhtml( getExampleData() );

$xpath = new DOMXPath($doc);
foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) {
    echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n";
}

function getExampleData() {
    return '<html><head><title>...</title></head><body>
    <p>
        not <a href="wrong">this one</a> but ....
    </p>
    <p id="abc">
        Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
    </p>
    </body></html>';
}

请参阅http://docs.php.net/DOMDocument和http://docs.php.net/DOMXPath

Answer 3

您不应该使用RegEx。您应该使用XML / DOM解析器。我使用DOMDocument快速完成了这项工作。

$links = array();
$dom = new DOMDocument;
@$dom->loadHTML('Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.');
$xPath = new DOMXPath($dom);
$a = $xPath->query('//a');
for($i=0; $i<$a->length; $i++){
    $e = $a->item($i);
    $links[] = array(
        'title' => $e->nodeValue,
        'link' => $e->getAttribute('href')
    );
}
print_r($links);

DEMO：http://codepad.org/2LEn2CAJ

Answer 4

preg_match_all("/<a[^>]*href=\"([^\"]*)[^>]*>([^<]*)</a>/", $v['message'], $urls, PREG_SET_ORDER)

应该能够满足您的需求。它不是一个关联的数组，但它应该是你想要的格式的嵌套数组。

Answer 5

对于建议使用DOM的人来说，使用DOM可能会很好。但是，当然你不会使用FULL DOM解析器来解析几个网址/标题！

只需使用正则表达式：

/<a.*href="([^" ]*)".*>(.*)<\/a>/iU

正则表达式找到所有的URL和标题

5 个答案: