如何在谷歌数据交换机上部署一个nutch工作?

时间:2017-07-26 00:16:54

标签: hadoop nutch google-cloud-dataproc

我一直在尝试在我的Google Hadoop数据集群上部署一个nutch作业(使用自定义插件),但我遇到了很多错误(我怀疑这些错误)。

我需要一步一步明确指导如何做到这一点。该指南应包括如何在gs存储桶和本地文件系统(Windows 7)中设置权限和访问文件。

我尝试过这种配置,但没有成功:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

我也尝试过:

Region: global 
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: org.apache.nutch.crawl.Crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

跟进:我已取得一些进展,但我现在收到此错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

我知道它与文件系统有关。如何访问gcp文件系统和hadoop文件?

跟进:我在这个配置上取得了一些进展:

{ "reference": { "projectId": "ageless-valor-174413", "jobId": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clusterName": "first-cluster", "clusterUuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "ERROR", "details": "Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "stateStartTime": "2017-07-28T18:59:13.518Z" }, "statusHistory": [ { "state": "PENDING", "stateStartTime": "2017-07-28T18:58:57.660Z" }, { "state": "SETUP_DONE", "stateStartTime": "2017-07-28T18:59:00.811Z" }, { "state": "RUNNING", "stateStartTime": "2017-07-28T18:59:02.347Z" } ], "driverOutputResourceUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "driverControlFilesUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopJob": { "mainClass": "org.apache.nutch.crawl.Injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb/" ], "jarFileUris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-SNAPSHOT.job" ], "loggingConfig": {} } }

但我现在收到此错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

1 个答案:

答案 0 :(得分:0)

您可以使用正确的方案(gs)来引用谷歌云存储文件,然后将默认文件系统更改为谷歌云存储来解决此问题。

第1步:

替换
https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

第2步:

将以下属性添加到nutch-site.xml文件中:

<property> <name>fs.defaultFS</name> <value>gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

(此属性名为&#34; fs.default.name&#34;旧版本的hadoop。)