Question

我想从this网站获取一些数据但是你可以在他们的html代码中看到有一些奇怪的东西在<TABLE BORDER=0 CELLSPACING=1 CELLPADDING=3 WIDTH=100%>而没有使用“”和其他一些东西，所以我'当我尝试使用SimpleXmlElement解析表时出错，我已经使用了一段时间并且在某些网站上运行得很好，我正在做类似的事情：

$html = file_get_html('https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera');
$table = $html->find('table', 4);

$xml = new SimpleXmlElement($table);

我得到了一堆错误和东西，所以有没有一种方法可以在发送到SimpleXmlElement之前清理代码，或者使用其他类型的DOM类？你们推荐什么？

Answer 1

HTML代码的问题在于标记属性没有用引号括起来：HTML中允许使用不带引号的属性，但不允许使用XML。

如果您不关心属性，可以继续使用Simple HTML Dom，否则您必须更改HTML解析器。

使用简单HTML DOM清理属性：

开始创建一个清除所有节点属性的函数：

function clearAttributes( $node )
{
    foreach( $node->getAllAttributes() as $key => $val )
    {
        $node->$key = Null;
    }
}

然后将该功能应用于<table>，<tr>和<td>个节点：

clearAttributes( $table );

foreach( $table->find('tr') as $tr )
{
    clearAttributes( $tr );

    foreach( $tr->find( 'td' ) as $td )
    {
        clearAttributes( $td );
    }

}

最后但并非最不重要：网站HTML包含大量编码字符。如果您不希望在XML中看到很多<td>1 </td><td>0 </td>，则必须先在字符串前面添加utf-8声明，然后再将其导入SimpleXml对象：

$xml = '<?xml version="1.0" encoding="utf-8" ?>'.html_entity_decode( $table );
$xml = new SimpleXmlElement( $xml );

的 phpFiddle demo

使用DOMDocument保存属性：

内置DOMDocument类比Simple HTML Dom更强大，内存更少。在这种情况下，它将为您格式化原始HTML。尽管外表，它的使用很简单。

首先，您必须初始化DOMDocument对象，设置libxml_use_internal_errors（以禁止格式错误的HTML上的大量警告）并加载您的网址：

$dom = new DOMDocument(); libxml_use_internal_errors( 1 ); $dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' ); $dom->formatOutput = True;

然后，您检索所需的<table>：

$table = $dom->getElementsByTagName( 'table' )->item(4);

并且，与Simple HTML Dom示例一样，您必须在utf-8声明前加上奇怪的字符：

$xml = '<?xml version="1.0" encoding="utf-8" ?>'.$dom->saveHTML( $table ); $xml = new SimpleXmlElement( $xml );

的 phpFiddle demo

如您所见，将节点检索为HTML的DOMDocument语法与Simple HTML Dom不同：您需要始终引用主对象并指定要作为参数打印的节点：

echo $dom->saveHTML(); // print entire HTML document echo $dom->saveHTML( $node ); // print node $node

修改：删除＆amp; nbsp;使用DOMDocument：

要从HTML中删除不需要的 ，您可以预加载HTML并使用str_replace。

更改此行：

$dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );

用这个：

$data = file_get_contents( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' ); $data = str_replace( ' ', '', $data ); $dom->loadHTML( $data );

使用php从另一个网站清理html代码

1 个答案:

使用简单HTML DOM清理属性：

使用DOMDocument保存属性：

修改：删除＆amp; nbsp;使用DOMDocument：