Question

我在EMR上运行Nutch 2.3（AMI版本2.4.2）。爬网步骤在本地和分布式模式（hadoop -jar apache-nutch-2.3.job <MainClass> <args>）中正常工作，并且能够通过在本地模式下启动其余服务来调用这些步骤。但是，当我尝试以分布式模式（hadoop -jar apache-nutch-2.3.job org.apache.nutch.api.NutchServer）运行其余的时，剩下的就是接收呼叫，但是没有完成任务。在分布式模式下运行nutch的正确方法是什么？

信息

当InjectorJob在分布式模式下脱机运行时，输出如下：


COMMAND:
hadoop jar ./apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob s3://myemrbucket/urls -crawlId 2

15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: starting at 2015-11-19 09:55:06
15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: s3://myemrbucket/urls
15/11/19 09:55:06 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:55:08 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '2_webpage'Assuming they are the same.
15/11/19 09:55:08 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:55:08 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:55:08 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:55:08 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:55:10 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
15/11/19 09:55:10 INFO mapred.JobClient: Setting group to hadoop
15/11/19 09:55:10 INFO input.FileInputFormat: Total input paths to process : 1
15/11/19 09:55:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
15/11/19 09:55:10 WARN lzo.LzoCodec: Could not find build properties file with revision hash
15/11/19 09:55:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
15/11/19 09:55:10 WARN snappy.LoadSnappy: Snappy native library is available
15/11/19 09:55:10 INFO snappy.LoadSnappy: Snappy native library loaded
15/11/19 09:55:10 INFO mapred.JobClient: Running job: job_201511182052_0037
15/11/19 09:55:11 INFO mapred.JobClient:  map 0% reduce 0%
15/11/19 09:55:38 INFO mapred.JobClient:  map 100% reduce 0%
15/11/19 09:55:43 INFO mapred.JobClient: Job complete: job_201511182052_0037
15/11/19 09:55:43 INFO mapred.JobClient: Counters: 20
15/11/19 09:55:43 INFO mapred.JobClient:   Job Counters 
15/11/19 09:55:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16424
15/11/19 09:55:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient:     Rack-local map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient:     Launched map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/11/19 09:55:43 INFO mapred.JobClient:   File Output Format Counters 
15/11/19 09:55:43 INFO mapred.JobClient:     Bytes Written=0
15/11/19 09:55:43 INFO mapred.JobClient:   injector
15/11/19 09:55:43 INFO mapred.JobClient:     urls_injected=1
15/11/19 09:55:43 INFO mapred.JobClient:   FileSystemCounters
15/11/19 09:55:43 INFO mapred.JobClient:     HDFS_BYTES_READ=98
15/11/19 09:55:43 INFO mapred.JobClient:     S3_BYTES_READ=61
15/11/19 09:55:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=36254
15/11/19 09:55:43 INFO mapred.JobClient:   File Input Format Counters 
15/11/19 09:55:43 INFO mapred.JobClient:     Bytes Read=61
15/11/19 09:55:43 INFO mapred.JobClient:   Map-Reduce Framework
15/11/19 09:55:43 INFO mapred.JobClient:     Map input records=1
15/11/19 09:55:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=193712128
15/11/19 09:55:43 INFO mapred.JobClient:     Spilled Records=0
15/11/19 09:55:43 INFO mapred.JobClient:     CPU time spent (ms)=3960
15/11/19 09:55:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=298319872
15/11/19 09:55:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1525059584
15/11/19 09:55:43 INFO mapred.JobClient:     Map output records=1
15/11/19 09:55:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=98
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1
15/11/19 09:55:44 INFO crawl.InjectorJob: Injector: finished at 2015-11-19 09:55:44, elapsed: 00:00:38

通过REST调用它，在发出以下输出后，作业被卡住了：


POST ARGS:

    {
      "crawlId":"11",
      "confId":"default",
      "type":"INJECT",
      "args":{"seedDir":"s3://myemrbucket/urls"}
    }

15/11/19 09:46:14 INFO api.NutchServer: Starting NutchServer on port: 8081 with logging level: INFO ...
Nov 19, 2015 9:46:14 AM org.restlet.engine.connector.NetServerHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8081
15/11/19 09:46:14 INFO api.NutchServer: Started NutchServer on port 8081
Nov 19, 2015 9:46:25 AM org.restlet.engine.log.LogFilter afterHandle
INFO: 2015-11-19    09:46:25    1xx.xx.x.xx -   -   8081    POST    /job/create -   200 28  110 498 http://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8081   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36-
15/11/19 09:46:25 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:46:27 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '11_webpage'Assuming they are the same.
15/11/19 09:46:28 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:46:28 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:46:28 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:46:28 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:46:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

并没有继续前进。

Nutch Rest在分布式模式下不工作EMR

信息

0 个答案: