DOM PHP - 查找网页中的所有可能文本(标题,占位符......)

时间:2015-09-03 10:20:23

标签: php html dom

我有一个经典的HTML网页

<html>
<head>
  <meta charset="utf-8">
  <title>Some text</title>
  <link rel="stylesheet" href="style.css">
  <script src="script.js"></script>
  <script>
      var text = "Hi guys !";
  </script>
</head>
<body>
    <h1>Hello guys</h1>
    <p>Some text <strong>is more important</strong></p>
    <input value="Here also is some text" placeholder="and here too">
    <a href="not here">here is some text</a>
</body>
</html>

我希望能够使用php从网页上获取所有文本。 检查DOMText的nodeType将忘记占位符。

有没有一种简单的方法可以快速获取所有真实文本(在我的情况下意味着所有英文文本)?

3 个答案:

答案 0 :(得分:0)

假设您只想要body元素的孩子......

示例HTML

<html><head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <title> Example</title>
</head>
<body>
  a <div>b<span>c</span></div>
</body></html>

的JavaScript

var body = document.body;
var textContent = body.textContent || body.innerText;

console.log(textContent);  //   a bc

您需要检查textContent,因为我们的好朋友IE使用的是innerText

如果你有一个像jQuery这样的库,那就容易多了,即$('body').text()

Refre This Also

答案 1 :(得分:0)

参考:http://www.phpro.org/examples/Get-Text-Between-Tags.html

<?php
$html='<html>
<head>
<meta charset="utf-8">
<title>Some text</title>
<link rel="stylesheet" href="style.css">
<script src="script.js"></script>
<script>
  var text = "Hi guys !";
</script>
</head>
<body>
<h1>Hello guys</h1>
<p>Some text <strong>is more important</strong></p>
<input value="Here also is some text" placeholder="and here too">
<a href="not here">here is some text</a>
</body>
</html>';

$content = getTextBetweenTags('body', $html);

foreach( $content as $item )
{
echo $item.'<br />';
}
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;

/*** load the html into the object ***/
if($strict==1)
{
    $dom->loadXML($html);
}
else
{
    $dom->loadHTML($html);
}

/*** discard white space ***/
$dom->preserveWhiteSpace = false;

/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);

/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
    /*** add node value to the out array ***/
    $out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}

答案 2 :(得分:0)

使用DomDocument的textContent属性

<?
error_reporting(-1); 

$dom = new DomDocument();
$dom->loadHTML($str);
echo $dom->textContent;

结果

Some text
      var text = "Hi guys !";

    Hello guys
    Some text is more important
    here is some text