Question

我已关注此博客阅读存储在Google存储桶中的数据。 https://cloud.google.com/dataproc/docs/connectors/install-storage-connector 它工作得很好。以下命令

hadoop fs -ls gs://the-bucket-you-want-to-list

给了我预期的结果。但是当我尝试使用pyspark使用

读取数据时

rdd = sc.textFile("gs://crawl_tld_bucket/")，

它会抛出以下错误：

`

py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`

如何完成它？

Answer 1

要访问Google云端存储，您必须包含云端存储连接器：

spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py

或

pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar

在火花中阅读谷歌桶数据

1 个答案: