我只想在XPath中仅检索body元素的文本时仅排除JavaScript标记内容
▼的index.html
<body>
I want to acquire only "text excluding HTML tag" included in this part.
<script language="JavaScript" type="text/javascript">
var foo = 42;
</script>
</body>
我使用DomCrawler创建了以下代码。但是,因为它包含JavaScript标记内容,我无法获得预期的结果..
<?php
$crawler->filterXPath('//body')->each(function (DomCrawler $node) use ($url) {
$result = trim($node->text());
});
答案 0 :(得分:2)
尝试一下:
<?php
$x = '<body>
I want to acquire only "text excluding HTML tag" included in this part.
<script language="JavaScript" type="text/javascript">
var foo = 42;
</script>
</body>';
$dom = new DOMDocument();
$dom->loadHTML($x);
$script = $dom->getElementsByTagName('script')->item(0);
$script->parentNode->removeChild($script);
$body = $dom->getElementsByTagName('body')->item(0);
echo $body->nodeValue;
此处的工作示例https://3v4l.org/n2UQT
答案 1 :(得分:1)
我建议你使用DomXpath来过滤内容。 通过查询。 我不太确定Domcrawler。
<?php
// to retrieve selected html data, try these DomXPath examples:
$file = $DOCUMENT_ROOT. "test.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
// example 1: for everything with an id
//$elements = $xpath->query("//*[@id]");
// example 2: for node data in a selected id
//$elements = $xpath->query("/html/body/script");
// example 3: same as above with wildcard
$elements = $xpath->query("*/script");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>