使用Nutch 2.3我的所有种子网址都被拒绝了

时间:2015-07-03 06:40:15

标签: apache web-crawler nutch

我的dmoz / urls文件中有84个网址 当我执行命令时:bin / nutch inject dmoz

我得到以下内容:

[ec2-user@ip-172-31-47-66 local]$ bin/nutch inject dmoz/
InjectorJob: starting at 2015-07-03 02:33:41
InjectorJob: Injecting urlDir: dmoz
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 84
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03

所有网址都被拒绝了,这是我的nutch / conf / regex-url.xml的片段

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
下面的

是我执行的hadoop.log输出:

2015-07-03 02:33:41,095 INFO  crawl.InjectorJob - InjectorJob: starting at 2015-07-03 02:33:41
2015-07-03 02:33:41,096 INFO  crawl.InjectorJob - InjectorJob: Injecting urlDir: dmoz
2015-07-03 02:33:43,301 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2015-07-03 02:33:43,329 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-07-03 02:33:43,389 WARN  snappy.LoadSnappy - Snappy native library not loaded
2015-07-03 02:33:44,278 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2015-07-03 02:33:44,430 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-07-03 02:33:44,768 INFO  crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 84
2015-07-03 02:33:44,768 INFO  crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0
2015-07-03 02:33:44,769 INFO  crawl.InjectorJob - Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03

我非常感谢有人可以帮我解决这个问题,基本上我的所有网址都被拒绝了,我不知道为什么。

由于 -Hadi

2 个答案:

答案 0 :(得分:1)

如果您使用的是/ local运行时环境,则不需要为conf /文件中的每个更改重新编译。

在构建了nutch的运行时(使用> ant运行时)后,编译会在$NUTCH_HOME/runtime/local下创建/ local环境。在此之下,有一个conf /目录,它本质上是$NUTCH_HOME/conf的副本。 但是,您可以(并且应该)编辑那里的文件以更改/ local配置。

因此,如果您想更改抓取工具的名称,例如,请修改$NUTCH_HOME/runtime/local/conf/nutch-site.xml并将属性http.agent.name添加/编辑为您想要的任何名称。

答案 1 :(得分:0)

好吧,花了很多时间试图解决问题...因为我改变了conf / regex-urlfilter.txt,我不得不使用“ant runtime”来重建nutch ......事情最终工作了,所以我过去两天的结论和教训是,总是在改变之后编译nutch。