每个人都知道我们应该总是使用DOM技术而不是正则表达式从HTML中提取内容,但我觉得我永远不会相信SimpleXML扩展或类似的扩展。
我正在编写OpenID实现,我尝试使用SimpleXML进行HTML发现 - 但是我的第一次测试(使用alixaxel.myopenid.com)产生了很多错误:
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: © 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
我记得有一种方法可以使SimpleXML总是解析文件,如果文档包含错误则独立 - 我不记得具体的实现,但我认为它涉及使用DOMDocument。确保SimpleXML始终解析任何给定文档的最佳方法是什么?
请不要建议使用Tidy,我认为扩展速度很慢,而且很多系统都没有。
答案 0 :(得分:11)
您可以使用DOM's loadHTML加载HTML,然后将结果导入SimpleXML。
IIRC,它仍然会扼杀一些的东西,但它会接受现实世界中破碎网站存在的任何东西。$html = '<html><head><body><div>stuff & stuff</body></html>';
// disable PHP errors
$old = libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
// restore the old behaviour
libxml_use_internal_errors($old);
$sxe = simplexml_import_dom($dom);
die($sxe->asXML());
答案 1 :(得分:0)
你总是可以尝试使用SAX解析器......对错误更加健壮。
在大型XML上可能效率不高。