Question

如何获取没有html标签的html页面源代码？例如：

<meta http-equiv="content-type" content="text/html; charset=utf-8" /> 
<meta http-equiv="content-language" content="hu"/> 
<title>this is the page title</title>
<meta name="description" content="this is the description" />
<meta name="keywords" content="k1, k2, k3, k4" />
start the body content
<!-- <div>this is comment</div> -->
<a href="open.php" title="this is title attribute">open</a>
End now one noframes tag.
<noframes><span>text</span></noframes>
<select name="select" id="select"><option>ttttt</option></select>
<div class="robots-nocontent"><span>something</span></div>
<img src="url.png" alt="this is alt attribute" />

我需要这个结果：

this is the page title this is the description k1, k2, k3, k4 start the body content this is title attribute open End now one noframes tag. text ttttt something this is alt attribute

我也需要标题和alt属性。想法？

Answer 1

这不能以自动方式完成。 PHP无法知道您要忽略哪些节点属性。你要么必须创建一些迭代所有属性和文本节点的代码，你可以提供地图，定义何时使用节点的内容，或者你只需要逐个选择你想要的内容。

另一种方法是使用XMLReader。它允许您遍历整个文档并定义元素名称的回调。这样，您就可以定义如何处理元素。参见

http://www.ibm.com/developerworks/library/x-pullparsingphp.html

Answer 2

你可以用正则表达式来做。

$regex = '/\<.\>/';

是一个非常简单的开始，可以删除围绕它的<和>的任何内容。但是为了做到这一点，你将不得不将HTML作为file_get_contents()或其他一些将代码转换为文本的函数。

附录：

如果你想要拉出单个属性，你将不得不编写一个更复杂的正则表达式来拉出该文本。例如：

$regex2 = '/\<.(?<=(title))(\=\").(?=\")/';

假设你在标题之前没有其他匹配的表达式，那么会在<和title="之间拉出（我认为......我还在学习RegEx）任何文本。同样，这将是一个相当复杂的正则表达式过程。

Answer 3

我的解决方案有点复杂，但对我来说效果很好。

如果您确定自己拥有XHTML，可以简单地将代码视为XML（但必须将所有内容都放在正确的包装中）。

然后使用XSLT，您可以定义一些满足您需要的基本模板。

html到domdocument类的文本

3 个答案: