Question

我正在使用cheerio和node.js来解析网页，然后使用css选择器在其上查找数据。 Cheerio在格式错误的HTML上表现不佳。 jsdom更宽容，但两者的行为都不一样，而且在某些情况下，当其他工作正常时，我看到两者都破裂了。

Chrome在创建DOM时使用相同格式错误的HTML似乎做得很好。

如何复制Chrome从格式错误的HTML创建DOM的能力，然后将此DOM的“已清理”html表示形式提供给cheerio进行处理？

这样我就会知道它得到的html是完整的。我通过设置page.content尝试了phantomjs，但是当我读取page.content的值时，html仍然是格式错误。

Answer 1

所以你可以使用更宽容的https://github.com/aredridel/html5/，并根据我的经验来解决jsdom失败的问题。

但是上次我测试它，几个月前，它是超级慢。我希望它变得更好。然后还有可能产生一个phantomjs进程，并在stdout上输出你想要将它反馈给你的Node的数据的json。

Answer 2

这似乎可以解决这个问题，使用phantomjs-node和jquery：

function cleanHtmlWithPhantom(html, callback){
    var phantom = require('phantom');
    phantom.create(
        function(ph){
            ph.createPage(
                function(page){
                    page.injectJs(
                        "/some_local_location/jquery_1.6.1.min.js",
                        function(){
                            page.evaluate(
                                function(){
                                    $('html').html(newHtml)
                                    return $('html').html();
                                }.toString().replace(/newHtml/g, "'"+html+"'"),
                                function(result){
                                    callback("<html>" + result + "</html>")
                                    ph.exit();
                                }
                            )
                        }
                    );
                }
            )
        }
    )
}

cleanHtmlWithPhantom(
    "<p>malformed",
    function(newHtml){
        console.log(newHtml);
    }
)

如何复制Chrome从错误的HTML中“解析”DOM的能力？

2 个答案: