Question

我正在尝试使用preg_match()提取文本，该文本未包含在<p>或<img>之类的标记中。此文本是从数据库中检索的，而我正在使用PHP。

This should be extracted <p>I do not want this</p> This should be extracted <a>This may appear after other tags and I do not want this</a>

我尝试做(.*)(<p>|<a>|<\/p>|<\/a>)(.*)，但这将捕获所有内容，直到最后一个标签和较早的标签与标签外的文本一起被捕获。

我试图像这样在Stackoverflow上进行搜索： Match text outside of html tags，但是当我将其粘贴到regex101.com中时，所提供的正则表达式会出现模式错误。

谢谢您的帮助，

Answer 1

您可以使用PHP的DOMDocument和DOMXPath来获取所需的值。诀窍是将数据库中的HTML包装在一个<div>标记中，然后可以将其加载到DOMDocument中并使用DOMXPath来搜索{ {1}}标签，它们是纯文本，使用<div>路径：

text()

输出：

$html = 'This should be extracted <p>I do not want this</p> This should also be extracted <a>This may appear after other tags and I do not want this</a>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$texts = array();
foreach ($xpath->query('/div/text()') as $text) {
    $texts[] = $text->nodeValue;
}
print_r($texts);

Demo on 3v4l.org

在html标签外提取文本

1 个答案: