我想创建一个合适的preg_match模式,以提取某些网页的<link *rel="stylesheet"* />
内的所有<head>
。所以这种模式:#<link (.+?)>#is
工作正常,直到我意识到它还捕获了<link rel="shortcut icon" href="favicon.ico" />
中的<head>
。所以我想改变模式,以确保在链接中的某个地方有单词样式表。我认为它需要使用一些外观,但我不知道该怎么做。任何帮助将不胜感激。
答案 0 :(得分:2)
我们再来一次...... don't use a regex to parse html,使用 html解析器,例如PHP DOMDocument。
以下是如何使用它的示例:
$html = file_get_contents("https://stackoverflow.com");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//link[@rel='stylesheet']") as $link)
{
echo $link->getAttribute("href");
}
答案 1 :(得分:0)
要使用正则表达式执行此操作,最好将此操作作为两部分操作,第一部分是将头部与身体分开,以确保您只在头部内工作。
然后第二部分将解析头部寻找所需的链接
<link\s*(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?rel=['"]?stylesheet)(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*\s*>
此表达式将执行以下操作:
<link
代码rel='stylesheet
现场演示
https://regex101.com/r/hC5dD0/1
示例文字
注意最后一行中的困难边缘情况。
<link *rel="stylesheet"* />
<link rel="shortcut icon" href="favicon.ico" />
<link onmouseover=' rel="stylesheet" ' rel="shortcut icon" href="favicon.ico">
样本匹配
<link *rel="stylesheet"* />
NODE EXPLANATION
----------------------------------------------------------------------
<link '<link'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
rel= 'rel='
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
stylesheet 'stylesheet'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'] any character except: '''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^"] any character except: '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------