我正在尝试解析来自url的文本内容。这是代码:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
我想只获得在页面上写的文字。没有页面源代码。对此有何看法?我已经谷歌搜索了,但上面的方法只出现在各处。
答案 0 :(得分:4)
您可以使用DOMDocument和DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
您也可以这样做:
,而不是使用xpath$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.
答案 1 :(得分:2)
$content = file_get_contents(strip_tags($url));
这将删除页面中的HTML标记
答案 2 :(得分:1)
要删除html标记,请使用:
$text = strip_tags($text);
答案 3 :(得分:1)
简单的 cURL
将解决此问题。的 [TESTED] 强>
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>