Question

我正在使用nutch进行爬行，但是在有空间的网址上失败了。我已经浏览了这个链接http://lucene.472066.n3.nabble.com/URL-with-Space-td619127.html，但没有得到满意的答案。

它适用于seed.txt文件中的URL，但不适用于页面解析内容中的URL

我使用了一个在conf / seed.txt文件中有空格的URL，它用％20替换了空格，我能够抓取页面。我在regex-normalize.xml中添加了以下内容

<regex> 
 <pattern> </pattern> 
 <substitution>%20</substitution> 
</regex>

另外，我在nutch-site.xml中添加了regex-normalize.xml的引用。但我仍面临同样的问题。

Answer 1

我有同样的问题，但有更多的字符，所以我改变了Fetcher.java！新的URL在“feed”部分添加到Queue！你必须找到这一行：

nURL.set(url.toString());

并将其替换为：

nURL.set(URIUtil.encodeQuery(url.toString()));

Answer 2

我遇到了同样的问题，并将其添加到我的regex-normalize.xml

中

<regex> 
   <pattern>&#x20;</pattern> 
   <substitution>%20</substitution> 
</regex>