如何在字符串中提取所有锚标记,其href及其锚文本?

时间:2014-05-07 20:04:42

标签: php regex preg-replace preg-match domdocument

我需要以几种不同的方式处理html字符串中的链接。

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>.'
$links = extractLinks($str);
foreach ($links as $link) {
    $pattern = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
    if (preg_match($pattern,$str)) {
        // Process Remote links
        //   For example, replace url with short url,
        //   or replace long anchor text with truncated
    } else {
        // Process Local Links, Anchors

    }
}
function extractLinks($str) {
    // First, I tried DomDocument
    $dom = new DomDocument();
    $dom->loadHTML($str);
    return $dom->getElementsByTagName('a');
    // But this just returns:
    //   DOMNodeList Object
    //   (
    //       [length] => 3
    //   )

    // Then I tried Regex
    if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $str, $matches)) {
        print_r($matches);
    }
    // But this didn't work either.
}

extractLinks($str)的理想结果:

[0] => Array(
           'str' = '<a href="http://example.com/abc" rel="link">string</a>',
           'href' = 'http://example.com/abc';
           'anchorText' = 'string'
       ),
[1] => Array(
           'str' = '<a href="/local/path" title="with attributes">number</a>',
           'href' = '/local/path';
           'anchorText' = 'number'
       ),
[2] => Array(
           'str' = '<a href="#anchor" data-attr="lots">links</a>',
           'href' = '#anchor';
           'anchorText' = 'links'
       );

我需要所有这些,所以我可以做一些事情,比如编辑href(添加跟踪,缩短等),或用其他东西替换整个标记(<a href="/u/username">username</a>可能变成username)。

这是我尝试做的demo

2 个答案:

答案 0 :(得分:12)

您只需将其更改为:

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
    <a href="/local/path" title="with attributes">number</a> of
    <a href="#anchor" data-attr="lots">links</a>.';

$dom = new DomDocument();
$dom->loadHTML($str);
$output = array();
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = array (
      'str' => $dom->saveHTML($item),
      'href' => $item->getAttribute('href'),
      'anchorText' => $item->nodeValue
   );
}

通过将其置于循环中并使用getAttributenodeValuesaveHTML(THE_NODE),您将获得输出

答案 1 :(得分:4)

喜欢这个

<a\s*href="([^"]+)"[^>]+>([^<]+)</a>
  1. 整体匹配是你想要的0数组元素
  2. 第1组捕获是您想要的1个数组元素
  3. 第2组捕获是你想要的2个数组元素
  4. 使用preg_match($pattern,$string,$m)

    数组元素将位于$m[0] $m[1] $m[3]

    Working PHP demo here

    $string = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
            <a href="/local/path" title="with attributes">number</a> of
            <a href="#anchor" data-attr="lots">links</a>. ';
    $regex='|<a\s*href="([^"]+)"[^>]+>([^<]+)</a>|';
    $howmany = preg_match_all($regex,$string,$res,PREG_SET_ORDER);
    print_r($res);