Question

我想知道如何在重定向后找到原始网址。它们实际上是在种子列表中找到的，但我不能保证将哪个url重定向到哪个url。在Fetcher阶段，我希望从Nutch.WRITABLE_REPR_URL_KEY读取它，但它被重定向的url覆盖。

有关如何从crawldb，segment或linkdb读取它们的任何建议吗？

PS：我只抓取种子列表中的第一级页面（深度：1）。

最佳， Tugcem。

Answer 1

您可以通过执行以下操作来转储外链接

bin/nutch readseg -dump crawl/segments/segmentname/ outputdir -nocontent -nofetch -    nogenerate -noparse -noparsetext

另外，为了正确遵循重定向，您可能希望在nutch-default.xml

中更改此属性

<property>
<name>http.redirect.max</name>
<value>5</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>

Nutch 1.6找到重定向的原始URL

1 个答案: