我已使用此命令添加了一组要抓取的种子
./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4
对于第一次迭代,所有命令(inject, generate, fetch, parse, update-table, Indexer & delete duplicates.)
都已成功执行。
对于第二次迭代," update-table"命令失败(请参阅错误日志以供参考),因为此命令失败,整个过程终止。
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1452969522-27478 -crawlId 1
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at 2016-01-17 02:10:17
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: 1452969522-27478
16/01/17 02:10:17 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-root/hadoop-unjar3649584948711945520/classes/plugins
16/01/17 02:10:18 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Plugins:
16/01/17 02:10:18 INFO plugin.PluginRepository: Rel-Tag microformat Parser/Indexer/Querier (microformats-reltag)
16/01/17 02:10:18 INFO plugin.PluginRepository: HTTP Framework (lib-http)
16/01/17 02:10:18 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
16/01/17 02:10:18 INFO plugin.PluginRepository: MetaTags (parse-metatags)
16/01/17 02:10:18 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient)
16/01/17 02:10:18 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
16/01/17 02:10:18 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
16/01/17 02:10:18 INFO plugin.PluginRepository: XML Libraries (lib-xml)
16/01/17 02:10:18 INFO plugin.PluginRepository: JavaScript Parser (parse-js)
16/01/17 02:10:18 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
16/01/17 02:10:18 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
16/01/17 02:10:18 INFO plugin.PluginRepository: Top Level Domain Plugin (tld)
16/01/17 02:10:18 INFO plugin.PluginRepository: Language Identification Parser/Filter (language-identifier)
16/01/17 02:10:18 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Metadata Indexing Filter (index-metadata)
16/01/17 02:10:18 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
16/01/17 02:10:18 INFO plugin.PluginRepository: Subcollection indexing and query filter (subcollection)
16/01/17 02:10:18 INFO plugin.PluginRepository: Link Analysis Scoring Plug-in (scoring-link)
16/01/17 02:10:18 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
16/01/17 02:10:18 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
16/01/17 02:10:18 INFO plugin.PluginRepository: More Indexing Filter (index-more)
16/01/17 02:10:18 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
16/01/17 02:10:18 INFO plugin.PluginRepository: SOLRIndexWriter (indexer-solr)
16/01/17 02:10:18 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons)
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/17 02:10:18 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/01/17 02:10:19 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x60a2630a connecting to ZooKeeper ensemble=localhost:2181
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:host.name=cism479
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_65
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: number of splits:2
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1452929501009_0024
16/01/17 02:10:28 INFO impl.YarnClientImpl: Submitted application application_1452929501009_0024
16/01/17 02:10:28 INFO mapreduce.Job: The url to track the job: http://cism479:8088/proxy/application_1452929501009_0024/
16/01/17 02:10:28 INFO mapreduce.Job: Running job: job_1452929501009_0024
16/01/17 02:10:39 INFO mapreduce.Job: Job job_1452929501009_0024 running in uber mode : false
16/01/17 02:10:39 INFO mapreduce.Job: map 0% reduce 0%
16/01/17 02:11:37 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_0, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
16/01/17 02:12:13 INFO mapreduce.Job: map 33% reduce 0%
16/01/17 02:12:24 INFO mapreduce.Job: map 50% reduce 0%
16/01/17 02:12:44 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_1, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
16/01/17 02:13:19 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_2, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
16/01/17 02:13:42 INFO mapreduce.Job: map 100% reduce 100%
16/01/17 02:13:43 INFO mapreduce.Job: Job job_1452929501009_0024 failed with state FAILED due to: Task failed task_1452929501009_0024_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/01/17 02:13:44 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=49949067
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1193
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=5
Other local map tasks=3
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=829677
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=276559
Total vcore-seconds taken by all map tasks=276559
Total megabyte-seconds taken by all map tasks=849589248
Map-Reduce Framework
Map input records=30201
Map output records=1164348
Map output bytes=250659088
Map output materialized bytes=49832245
Input split bytes=1193
Combine input records=0
Spilled Records=1164348
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=3541
CPU time spent (ms)=42980
Physical memory (bytes) snapshot=2062766080
Virtual memory (bytes) snapshot=5086490624
Total committed heap usage (bytes)=2127036416
File Input Format Counters
Bytes Read=0
Exception in thread "main" java.lang.RuntimeException: job failed: name=[1]update-table, jobid=job_1452929501009_0024
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1452969522-27478 -crawlId 1
Failed with exit value 1.
由于错误的网址,因此错误很明显。那么有没有办法摆脱这种畸形的网址?或者是否有任何解决方案可以跳过这些网址或绕过它们,以便后续流程执行? 请指教。
答案 0 :(得分:0)
要跳过这些类型的网址(格式错误的网址),您应该在文件conf / regex-urlfilter.txt中创建Nutch过滤器。