当作业失败时,nutch 2.0重复获取页面

时间:2012-08-27 08:06:36

标签: apache web-crawler nutch

我使用mysql作为存储后端和nutch。

抓取某些网站时作业失败。得到以下异常并在到达此页面时退出nutch:http://www.appchina.com/users.html

Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
    at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

所以我修改了./src/java/org/apache/nutch/util/NutchJob.java 改变了           if(getConfiguration()。getBoolean(“fail.on.job.failure”,true)){ 至           if(getConfiguration()。getBoolean(“fail.on.job.failure”,false)){

重新编译后,我不会得到任何异常,但无限次重启爬行。

FetcherJob : timelimit set for : -1
FetcherJob: threads: 30
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 30
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.appchina.com/
fetching http://www.appchina.com/users.html
-finishing thread FetcherThread0, activeThreads=29
-finishing thread FetcherThread29, activeThreads=28
...
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0.4 pages/s, 137 137 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
Parsing http://www.appchina.com/
Parsing http://www.appchina.com/users.html

UPDATE hadoop.log中的错误

2012-09-17 18:48:51,257 WARN  mapred.LocalJobRunner - job_local_0004
java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
        at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
        at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
        at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
        ... 6 more
Caused by: java.sql.SQLException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
        ... 8 more

再次更新

我删除了创建的表gora,并使用VARCHAR(128)id和utf8mb4 DEFAULT CHARSET创建了一个类似的表。它现在有效。为什么呢?

有人帮忙吗?

1 个答案:

答案 0 :(得分:0)

您需要为Parse作业添加hadoop日志。附加的堆栈跟踪未显示该信息。你做了那个代码更改后,解析成功了吗?