pyspark错误:: java.io.IOException:方案:gs没有文件系统

时间:2019-04-09 14:31:16

标签: google-cloud-platform pyspark

我正在尝试从Google存储桶中将json文件读取到本地Spark机器上的pyspark数据帧中。这是代码:

from tkinter import *
from tkinter import ttk

root = Tk()
root.title("Tk test")
root.geometry("800x800")

frame_1 = ttk.Frame(root, relief="sunken", height="400", width="400")
frame_1.grid(row=0, column=0, rowspan=1, columnspan=1)
frame_2 = ttk.Frame(frame_1, relief="sunken", height="200", width="200")
frame_2.grid(row=0, column=0, rowspan=1, columnspan=1)
label_1 = ttk.Label(frame_2, text="Text")
label_1.grid(row=0, column=0, sticky="N, E")

root.mainloop()

它很好地从存储桶中读取文件(我可以从blob.name看到打印出来的内容),但是随后崩溃:

import pandas as pd
import numpy as np

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

conf = SparkConf().setAll([('spark.executor.memory', '16g'),
                        ('spark.executor.cores','4'),
                         ('spark.cores.max','4')]).setMaster('local[*]')


spark = (SparkSession.
              builder.
              config(conf=conf).
              getOrCreate())


sc = spark.sparkContext

import glob
import bz2
import json
import pickle


bucket_path = "gs://<SOME_PATH>/"
client = storage.Client(project='<SOME_PROJECT>')
bucket = client.get_bucket ('<SOME_PATH>')
blobs = bucket.list_blobs()

theframes = []

for blob in blobs:
    print(blob.name)        
    testspark = spark.read.json(bucket_path + blob.name).cache()
    theframes.append(testspark) 

我已经看到了在stackoverflow上讨论过的这种类型的错误,但是当我拥有pyspark时,大多数解决方案似乎都在Scala中,并且/或者涉及弄乱core-site.xml,但我没有做过。

我正在使用spark 2.4.1和python 3.6.7。

我们将不胜感激!

1 个答案:

答案 0 :(得分:1)

需要一些配置参数才能将“ gs”识别为分布式文件系统。

将此设置用于Google云存储连接器gcs-connector-hadoop2-latest.jar

spark = SparkSession\
        .builder\
        .config("spark.driver.maxResultSize", "40g") \
        .config('spark.sql.shuffle.partitions', '2001') \
        .config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar")\
        .getOrCreate()

可以从pyspark设置的其他配置

spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true, 
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')

或者,您可以在core-site.xml或spark-defaults.conf中设置这些配置。