Question

我们需要一个DOM解析器，它能够运行一堆模式并存储结果。为此我们正在寻找开放的库，我们可以开始，

能够通过regexp选择元素（例如，在类，id，元属性等其他属性中获取包含“price”的所有元素），
应该有很多帮手：删除评论，iframe等
并且非常快。
可以从浏览器扩展程序运行。

Answer 1

好的，我会说：
您可以使用jQuery。

ups ：

这是一个非常好的dom解析器
非常善于操作dom（删除/添加/编辑元素）
它有一个伟大而直观的api
它有一个大＆amp;伟大的社区=＆gt;很多关于任何jquery相关问题的答案
它适用于浏览器扩展程序（在Chrome中自行测试，它显然也适用于ff扩展程序：How to use jQuery in Firefox Extension）
它是轻量级的（大小约31KB - 缩小和压缩）
是跨浏览器
绝对是开源的

缩小：

它不依赖于正则表达式（虽然这是非常好的东西 - 正如dda已提到的那样），但正则表达式可用于过滤元素
不知道它是否可以访问/操纵评论

以下是一些jquery操作的示例：

// select all the iframe elements with the class advertisement 
// that have the word "porn" in their src attribute
$('iframe.advertisement[src*=porn]')
    // filter the ones that contains the word "poney" in their title 
    // with the help of a regex
    .filter(function(){
        return /poney/gi.test((this.title || this.document.title).test()));
    }) 
        // and remove them
        .remove()
        // return to the whole match
        .end()
    // filter them again, this time 
    // affect only the big ones
    .filter(function(){
        return $(this).width() > 100 && $(this).height() > 100;
    })
        // replace them with some html markup
        .replaceWith('<img src="harmless_bunnies_and_kitties.jpg" />');

Answer 2

node-htmlparser可以解析HTML，提供带有许多utils的DOM（也支持按函数过滤），并且可以在任何上下文中运行（甚至在WebWorkers中）。

我forked一段时间后，改进它以获得更好的速度并获得一些疯狂的结果（读取：甚至比原生的libexpat绑定更快）。

尽管如此，我建议您使用原始版本，因为它支持开箱即用的浏览器（我的fork可以使用browserify在浏览器中运行，这会增加一些开销）。

你知道一个开源的Javascript提取/ regexp引擎吗？

2 个答案: