Question

我有这种模式： /(\<iframe)(.*?)src="(.*?)(something)(.*?)"((\n|.)*?)(<\/iframe>)/ 有这样的主题：

<p><iframe src="blah.something.blah">words<br />
<span>tags</span><br />
<span>tags</span><br />
<span itemprop="description" content=""></span><br />
<span itemprop="duration" content="1818"></span><br />
</iframe></p>

虽然在regexr.com上使用JS进行测试时有效，但它在PHP上失败了。如果我删除换行符，然后将((\n|.)*?)切换为(.*?)，那么它可以正常工作，但这还不够好。

我做错了什么？

Answer 1

根据评论，“你永远不应该用正则表达式解析HTML”。

使用解析器：它不是太难，它为您提供了很多可能性。

使用DOMDocument和DOMXPath

在HTML示例中查看这些示例

首先，初始化DOMDocument，加载HTML并初始化DOMXPath：

$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );

要检索所有<iframe> src属性：

$iframes = $dom->getElementsByTagName( 'iframe' );
foreach( $iframes as $iframe )
{
    echo $iframe->getAttribute( 'src' ) . PHP_EOL;
}

从itemprop属性duration检索“1818”：

$duration = $xpath->query( '//span[@itemprop="duration"]/@content' );
echo $duration->item(0)->nodeValue . PHP_EOL;

上面的xPath模式意味着：

//                      Selects following pattern no matter where they are in the document
span                    with tag = 'span'
[@itemprop="duration"]  with attribute 'itemprop' = 'duration'
/@content               (get) attribute 'content'

详细了解DOMDocument
详细了解DOMXPath
详细了解xPath syntax

正则表达式，包括HTML标记之间新行的可能性

1 个答案: