我在php variable
<?php
$data="<meta charset='UTF-8'>
<meta name='keywords' content='your, tags'>
<meta name='description' content='150 words'>
<meta name='subject' content='your website's subject'>
<meta name='copyright' content='company name'>
<meta name='language' content='ES'>
<meta name='robots' content='index,follow'>
<meta name='revised' content='Sunday, July 18th, 2010, 5:15 pm'>
<meta name='abstract' content=''>
<meta name='topic' content=''>
<meta name='summary' content=''>
<meta name='Classification' content='Business'>
<meta name='author' content='name, email@hotmail.com'>
<meta name='designer' content=''>
<meta name='reply-to' content='email@hotmail.com'>
<meta name='owner' content=''>
<meta name='url' content='http://www.websiteaddrress.com'>
<meta name='identifier-URL' content='http://www.websiteaddress.com'>
<meta name='directory' content='submission'>
<meta name='pagename' content='jQuery Tools, Tutorials and Resources - O'Reilly Media'>
<meta name='category' content=''>
<meta name='coverage' content='Worldwide'>
<meta name='distribution' content='Global'>
<meta name='rating' content='General'>
<meta name='revisit-after' content='7 days'>
<meta name='subtitle' content='This is my subtitle'>
<meta name='target' content='all'>
<meta name='HandheldFriendly' content='True'>
<meta name='MobileOptimized' content='320'>
<meta name='date' content='Sep. 27, 2010'>
<meta name='search_date' content='2010-09-27'>
<meta name='DC.title' content='Unstoppable Robot Ninja'>
<meta name='ResourceLoaderDynamicStyles' content=''>
<meta name='medium' content='blog'>
<meta name='syndication-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='original-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='verify-v1' content='dV1r/ZJJdDEI++fKJ6iDEl6o+TMNtSu0kv18ONeqM0I='>
<meta name='y_key' content='1e39c508e0d87750'>
<meta name='pageKey' content='guest-home'>
<meta itemprop='name' content='jQTouch'>
<meta http-equiv='Expires' content='0'>
<meta http-equiv='Pragma' content='no-cache'>
<meta http-equiv='Cache-Control' content='no-cache'>
<meta http-equiv='imagetoolbar' content='no'>
<meta http-equiv='x-dns-prefetch-control' content='off'>";
我想提取列出的元标记的值, 包括名称元标记和httpequiv元标记
这就是我对此的看法:
// explode the string by newline
$parts=explode("\n",$data);
// loop through each meta tag line
foreach($parts as $part){
// match inside the name attribute and the content attribute
preg_match("/<meta name=\"(.*)\" content=\"(.*)\" \/>/i",$part,$matches);
// returns "</pre><pre>Array()"
print "<pre>".print_r($matches,true)."</pre>";
}
我认为我的正则表达式有问题。
答案 0 :(得分:0)
使用单引号引用,而不是双引号。结束标记不是/>
,而是>
没有空格:
preg_match("/<meta name='([^']*)' content='([^']*)'\s?\/?>/i", $part, $matches);
说明:
[^']* # get all data until ' is reached
\s? # with whitespace character (\s), or not (?)
\/? # with slash (/) or not (?)
这是一个同时使用双引号和多个空格的版本:
"/<meta\s*name=['\"]([^']*)['\"]\s*content=['\"]([^']*)['\"]\s?\/?>/i"
- &GT; online demo
但是,使用DOM解析器检查HTML元素总是更好。
答案 1 :(得分:0)
在正常情况下,最好/最可靠的建议是使用DomDocument或其他专用的HTML解析工具解析您的html。
这是实现DomDocument和Xpath的解决方案:
代码:
preg_match_all("~<meta (?:name|http-equiv)='(.*)' content='(.*)'>~", $html, $matches, PREG_SET_ORDER)
但是,由于输入数据格式不正确(内容属性值中未转义的单引号),这些值会在第一个单引号处截断。
在不截断这些值的情况下捕获目标数据的最直接的解决方法是使用a greedy regex pattern。
代码:
*
这将使您的模式有效,因为您的输入数据具有看似严格的格式。目标行具有2个目标属性,并且顺序相同,没有额外的字符可供使用。贪婪的PREG_SET_ORDER
量词将匹配零个或多个字符(努力匹配尽可能多的字符 - 包括撇号),同时遵守其他模式要求。此模式不会截断您的属性值。我正在使用{{1}}将元标记的数据组合在一起 - 您不必将它用于实际项目。这是Demo of the regex method and a commented out DomDocument method that demonstrates the quoting issue。