Question

我有这些数据：

<meta name="description" content="Access Kenya is Kenya's leading corporate Internet service provider and is a technology solutions provider in Kenya with IT and network solutions for your business.Welcome to the Yellow Network.Kenya's leading Corporate and Residential ISP" />;

我正在使用这个正则表达式：

<meta +name *=[\"']?description[\"']? *content=[\"']?([^<>'\"]+)[\"']?

获取网页描述一切正常，但是所有地方都有一个撇号。

我该如何逃避？

Answer 1

您的正则表达式考虑<meta>节点的这三个选项：

<meta name="description" content="Some Content" />
<meta name='description' content='Some Content' />
<meta name=description content=Some Content />

第三个选项不是有效的HTML，但一切都可能发生，所以......你是对的。

简单的方法是修改原始正则表达式结束标记并使用?非贪婪的运算符：

<meta +name *=[\"']?description[\"']? *content=[\"']?(.*?)[\"']? */?>
                                                      └─┘       └───┘
          search zero-or-more characters except following       closing tag characters

的 regex101 demo

但是 - 在这种情况下 - 如果你有这个元素会发生什么？

<meta content="Some Content" name="description" />

你的正则表达式会失败。

要真实匹配HTML节点，您必须使用解析器：

$dom = new DOMDocument(); libxml_use_internal_errors(1); $dom->loadHTML( $yourHtmlString ); $xpath = new DOMXPath( $dom ); $description = $xpath->query( '//meta[@name="description"]/@content' ); echo $description->item(0)->nodeValue);

将输出：

Some Content

是的，它是5行与1，但是使用此方法，您将匹配任何<meta name="description">（如果它包含第三个，无效的属性）。

详细了解DOMDocument

详细了解DOMXPath

阅读 why you can't parse [X]HTML with regular expressions

正则表达式使用preg_match替换网页元描述撇号

1 个答案: