我正在尝试在Nutch hbase设置中抓取Sample html文件,当我检索NutchDocument
(org.apache.nutch.indexer.NutchDocument
)以阅读内容时,我将获得如下文本格式的数据
tstamp: [1970-01-01T00:00:00.000Z]
digest: [52e6d9e5e5e96e2cfac7fcd92cd117f8]
host: []
boost: [1.0]
id: [:file/home/file.html]
title: [Nutch1]
url: [file:///home/file.html]
content: [Nutch1 Nutch1 The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release. Nutch-Nutch-Identifies the overall Positive]
但我期待的是html的原始内容,而不是文本。
我缺少任何设置吗?
由于
答案 0 :(得分:0)
查看2.x分支上的index-html插件。
此插件允许您索引文档的原始HTML内容。默认情况下,Nutch仅解析/提取和索引文本内容,默认情况下会忽略所有HTML标记。