Question

我尝试使用Ruby（而不是RoR）从html文件中提取内容

我这样做：

require 'sanitize'
require 'nokogiri'

doc = doc = Nokogiri::HTML(html_document)
a = Sanitize.fragment(doc.css('body'))

此提取内容在<body>标记内，并删除所有html标记。但是，遗憾的是，JS脚本仍然存在于<script>标记内。

除了html标签之外，如何删除JS脚本？

Answer 1

我假设您使用的是最新版本的Sanitize。

html = "<html><head><title></title><style>.red{color:red;}</style></head><body><div>... <b>some content</b> ...</div><script>... a script ...</script></body></html>"

Sanitize.fragment(html, :remove_contents => ['script'])
# => ".red{color:red;} ... some content ... "

Sanitize.fragment(html, :remove_contents => ['script', 'style'])
# => " ... some content ... "

请参阅：:remove_contents

在ruby上清理html文件的脚本标记内的JS脚本

1 个答案: