提取并删除不允许的锚标记

时间:2012-07-07 20:38:19

标签: php regex

我正在尝试编写一个脚本,该脚本将执行以下操作:

  1. 从文件或数据库中读取内容
  2. 从内容中提取所有锚标记
  3. 扫描所有链接并保留允许的链接,例如链接到社交网络,搜索引擎或权威域,并删除其余内容,同时保留其内容(锚文本)。
  4. 示例内容:

    <p><a rel="nofollow" href="http://www.test.com/tyest">test1</a></p>
    <p><a href="http://google.com">google</a></p>
    <p><a title="This is just a check" href="http://www.check.com">check</a></p>
    <p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>

    允许的域名:

    google.com
    msn.com
    ip.com

    期望的输出:

    <p>test1</p>
    <p><a href="http://google.com">google</a></p>
    <p>check</p>
    <p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>

    限制:

    1. 锚标记不会遵循任何特定规则,并且可以包含rel,title,descrition属性,也可以包含任何顺序。
    2. 锚文本本身也可以是一个链接,例如:http://google.com即使不允许链接也应该保留。
    3. 我完成了我的作业并尝试编写一个简单的裸级脚本,以使用不同的正则表达式以及在线提供的帮助开始初始工作,但没有成功。这是我的代码:

      // sample input
      $comment = '<p><a rel="nofollow" href="http://www.1google.com/tyest">test with no http</a></p>
                      <p><a rel="nofollow" href="http://google.com">just a domain name</a></p>
                      <p><a rel="nofollow" href="http://www.g1gle.com">check</a></p>
                      <p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>
                      <p><a rel="nofollow" href="http://osamashabrez.com">http://testx.osamashabrez.com</a></p>
                      <p><a rel="nofollow" href="http://www.subchalega.com">http://www.subchalega.com</a></p>
                      <p><a rel="nofollow" href="http://www.letscheck.com">http://www.letscheck.com</a></p>
                      <p><a rel="nofollow" href="http://www.google.com/osama/here/">http://www.google.com</a></p>
                      <p><a rel="nofollow" description="testing" title="google" href="http://www.google.com/last/">laaaaaaaa</a></p><h1>Header one</h1>
                      <p><a rel="nofollow" href="http://domain1.com">http://testx.osamashabrez.com</a></p>';
      
      // add http to the domain name if not already present
      function addhttp($url) {
          if (!preg_match('~^(?:f|ht)tps?://~i', $url)) {
              $url = 'http://' . $url;
          }
          return $url;
      }
      
      // removed deep links to match with the allowed URLS 
      function removeDeepLinks($url) {
          $pos = strrpos ( $url, '.com' );
          if ( $pos !== false )
              return substr( $url, 6, $pos-2 );
          return $url;
      }
      // allowed domains fetched from the db
      $domains = "http://osamashabrez.com\rhttp://google.com\rwordpress.org\rabc.com";
      $domains = preg_split( "~\r~", $domains, -1, PREG_SPLIT_NO_EMPTY );
      // adding http if not already present
      // will be done one when data will be inserted
      foreach ( $domains as $key => $domain ) { $domains[$key] = addhttp($domain); }
      // remove this and sky will fall on your head :D
      sort( $domains );
      print_r ( $domains );
      // regex to extract href="xyz.com" link as we can not use any other option
      // due to the uncertainity of data passed to this script
      $regex = '/(href=".*?")/is';
      if ($c=preg_match_all ($regex, $comment, $matches)) {
          $matches = $matches[1];
          foreach ( $matches as $key => $url ) {
              // remove deep links for matching
              $matches[$key] = removeDeepLinks($url);
          }
          print_r($matches);
          foreach( $matches as $key => $url ) {
              // if domain is not allowed
              if ( !array_search( $url, $domains ) ) {
                  // find position of URL
                  $pos_url     = strrpos( $comment, $url );
                  // fint the starting position of anchor tag
                  $pos_a_start = strrpos(substr($comment, 0, $pos_url), '<a ');
                  // fint the end
                  $pos_a_end   = strpos($comment, '</a>',$pos_url);
                  // extract the whole anchor tag
                  $anchor_tag  = substr($comment, $pos_a_start, $pos_a_end - $pos_a_start + 4);
                  echo "URL:\t" .$url . "\r";
                  echo "Anchor Tag:\t{$anchor_tag}\r";
                  echo "POS START :: END:\t{$pos_a_start}::{$pos_a_end}\r";
      
      
                  // something weired goes where commenting this line works but only the opening
                  // tags are removed from the text
                  // the code does work with some data inputs and does not work with a few others
                  $comment = substr($comment, 0, $pos_a_end) . substr($comment, $pos_a_end+4);
                  // removing opening tags
                  $opening_tag = substr( $anchor_tag, 0, strpos($anchor_tag, '>') +1 );
                  $comment = str_replace($opening_tag, '', $comment);
              }
          }
      }
      echo $comment;
      

      上面的代码正在使用一些数据输入和其他的中断,我想得到一些帮助,工作代码示例或我提供的代码的审查。还要提一下是否有更好的方法来完成工作。任何帮助将受到高度赞赏。

      由于

1 个答案:

答案 0 :(得分:1)

DOM解析器更适合此任务。

有很多选择,包括:

以下是使用QueryPath的示例:

$qp = qp($html)
foreach ($qp->find('a') as $link) {
    $href = $link->attr('href');
    // Get the host domain
    $host = parse_url($href, PHP_URL_HOST);
    // Check our allowed hosts
    if (!in_array($host, $allowedHosts) {
        // Replace the links HTML with just its text
        $link->html($link->text());
    }
}
// Echo our result
echo $query->top()->html();

(未经过测试,但应进行一些修改。)