Question

我正在寻找一个抓文章的框架，然后我找到了Nutch 2.1。以下是我的计划和问题：

1

将文章列表页面添加到url / seed.txt中这是一个问题。我真正想要编入索引的是文章页面，而不是文章列表页面。但是，如果我不允许将列表页面编入索引，Nutch将不执行任何操作，因为列表页面是入口。那么，如何仅在没有列表页面的文章页面上编制索引？

2

编写一个插件来解析'作者'，'日期'，'文章正文'，'标题'以及html中的其他信息。 Nutch 2.1中的'Parser'插件界面是：解析getParse（String url，WebPage页面） 'WebPage'类有一些预定义的属性：

public class WebPage extends PersistentBase {
  // ...
  private Utf8 baseUrl;
  // ...
  private ByteBuffer content; // <== This becomes null in IndexFilter
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,Utf8> headers;
  private Map<Utf8,Utf8> outlinks;
  private Map<Utf8,Utf8> inlinks;
  private Map<Utf8,Utf8> markers;
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

3

将文章索引到Solr后，另一个应用程序可以通过'date'查询它，然后将文章信息存储到Mysql中。我的问题是：Nutch可以将文章直接存储到Mysql中吗？或者我可以编写一个插件来指定索引行为吗？

Nutch对我的目的来说是个不错的选择吗？如果没有，你们为我建议另一个优质的框架/图书馆吗？谢谢你的帮助。

Answer 1

如果您正在寻找几个网站的文章，那么请查看http://www.crawl-anywhere.com/

它带有一个管理界面，您可以在其中指定要使用的是套管文章提取器（非常棒）。您还可以通过URL模式匹配指定要爬网的页面与要爬网和编入索引的页面。

如何扩展Nutch进行文章抓取

1 个答案: