preg_match模式

时间:2016-05-21 14:56:11

标签: php regex

我想创建一个合适的preg_match模式,以提取某些网页的<link *rel="stylesheet"* />内的所有<head>。所以这种模式:#<link (.+?)>#is工作正常,直到我意识到它还捕获了<link rel="shortcut icon" href="favicon.ico" />中的<head>。所以我想改变模式,以确保在链接中的某个地方有单词样式表。我认为它需要使用一些外观,但我不知道该怎么做。任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

我们再来一次...... don't use a regex to parse html,使用 html解析器,例如PHP DOMDocument
以下是如何使用它的示例:

$html = file_get_contents("https://stackoverflow.com");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//link[@rel='stylesheet']") as $link)
{
    echo $link->getAttribute("href");
}

PHPFiddle Demo

答案 1 :(得分:0)

要使用正则表达式执行此操作,最好将此操作作为两部分操作,第一部分是将头部与身体分开,以确保您只在头部内工作。

然后第二部分将解析头部寻找所需的链接

解析链接

<link\s*(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?rel=['"]?stylesheet)(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*\s*>

Regular expression visualization

此表达式将执行以下操作:

  • 找到所有<link代码
  • 确保链接标记具有所需的属性rel='stylesheet
  • 允许属性值包含单引号,双引号或无引号
  • 避免HTML Parse Police哭泣的凌乱和困难的边缘情况

实施例

现场演示

https://regex101.com/r/hC5dD0/1

示例文字

注意最后一行中的困难边缘情况。

<link *rel="stylesheet"* />
<link rel="shortcut icon" href="favicon.ico" />
<link onmouseover=' rel="stylesheet" ' rel="shortcut icon" href="favicon.ico">

样本匹配

<link *rel="stylesheet"* />

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  <link                    '<link'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    rel=                     'rel='
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    stylesheet               'stylesheet'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^']                     any character except: '''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------