Question

我想使用多个网页训练NLP模型以获得良好的精确度。由于我没有网页，因此我考虑在Amazon EMR上使用网络抓取工具。我想使用尊重robots.txt规则的分布式，可扩展且可扩展的开源解决方案。经过一番研究，我决定采用Apache Nutch。

我发现Nutch的主要撰稿人Julien Nioche this video对入门特别有用。虽然我使用了最新版本的Hadoop（Amazon 2.7.3）和Nutch（2.3.1），但我成功地完成了一个小例子工作。

不幸的是，我找不到从Nutch的输出中检索原始html文件的简单方法。在寻找这个问题的解决方案时，我发现了一些其他有用的资源（除了Nutch自己的wiki和tutorial页面）。

其中一些（例如this answer或this page）建议实施新的插件（或修改现有插件）：总的想法是添加一些在将任何已获取的html页面的内容发送到段之前，实际将其保存到文件的代码行。

其他人（如this answer）建议实施一个简单的后处理工具来访问这些段，浏览包含在那里的所有记录，并保存那些看似是html页面的内容。一个文件。

这些资源都包含（或多或少精确）指令和代码示例，但是当我尝试运行它们时我没有运气，因为它们引用了非常旧版本的Nutch。此外，由于缺乏资源/文档，我所有尝试使它们适应Nuth 2.3.1都失败了。

例如，我将以下代码添加到HtmlParser（parse-html插件的核心）的末尾，但是保存在指定文件夹中的所有文件都是空的：

String html = root.toString();
if (html == null) {
    byte[] bytes = content.getContent();
    try {
      html = new String(bytes, encoding);
    } catch (UnsupportedEncodingException e) {
        LOG.trace(e.getMessage(), e);
    }
}
if (html != null) {
    html = html.trim();
    if (!html.isEmpty()) {
        if (dumpFolder == null) {
            String currentUsersHomeFolder = System.getProperty("user.home");
            currentUsersHomeFolder = "/Users/stefano";
            dumpFolder = currentUsersHomeFolder + File.separator + "nutch_dump";
            new File(dumpFolder).mkdir();
        }
        try {
            String filename = base.toString().replaceAll("\\P{LD}", "_");
            if (!filename.toLowerCase().endsWith(".htm") && !filename.toLowerCase().endsWith(".html")) {
                filename += ".html";
            }
            System.out.println(">> " + dumpFolder+ File.separator +filename);
            PrintWriter writer = new PrintWriter(dumpFolder + File.separator + filename, encoding);
            writer.write(html);
            writer.close();
        } catch (Exception e) {
            LOG.trace(e.getMessage(), e);
        }
    }
}

在另一种情况下，相反，我得到以下错误（我喜欢它因为它提到了序言但它也让我感到困惑）：

[Fatal Error] data:1:1: Content is not allowed in prolog.

所以，在考虑将我的设置降级为Nutch 1.x之前，我的问题是：你是否有任何人不得不面对这个问题与最新版本的Nutch并成功解决了它？

如果是这样，可以与社区分享，或者至少提供一些解决方案的有用指示吗？

非常感谢提前！

PS：如果你想知道如何正确地将Nutch源打开到IntelliJ中，this answer可能实际上指向正确的方向。

Answer 1

很高兴您发现该视频很有用。如果您只需要网页来训练NLP模型，为什么不使用CommonCrawl数据集？它包含数十亿页，是免费的，可以省去大规模网页抓取的麻烦吗？

现在回答您的问题，您可以编写自定义的IndexWriter并将页面内容写入您想要的任何内容。我不使用Nutch 2.x因为我更喜欢1.x因为它更快，功能更多并且更容易使用（说实话我实际上更喜欢StormCrawler，但我有偏见）。 Nutch 1.x有一个WARCExporter类，它可以使用CommonCrawl使用的相同WARC格式生成数据转储;还有另一个以各种格式导出的类。

Answer 2

您可以通过编辑Nutch代码来保存原始HTML 首先按照https://wiki.apache.org/nutch/RunNutchInEclipse

在日食中运行nutch

在eclipse编辑文件FetcherReducer.java中完成ruunning nutch后，将此代码添加到输出方法，再次运行ant eclipse以重建类

最后，原始html将添加到数据库中的reportUrl列

  if (content != null) {
    ByteBuffer raw = fit.page.getContent();
    if (raw != null) {
        ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
        Scanner scanner = new Scanner(arrayInputStream);
        scanner.useDelimiter("\\Z");//To read all scanner content in one String
        String data = "";
        if (scanner.hasNext()) {
            data = scanner.next();
        }
        fit.page.setReprUrl(StringUtil.cleanField(data));
        scanner.close();
    }

获取Nutch 2.3.1提取的页面的原始html

2 个答案: