从Google Cloud Storage读取Dataproc

时间:2018-08-08 16:51:38

标签: google-cloud-platform google-cloud-storage google-cloud-dataproc

我正在尝试从Dataproc pyspark应用程序中的GCS读取csv或txt文件。我已经尝试了很多东西。迄今为止最有希望的:

#!/usr/bin/python 
import os
import sys
import pyspark
from pyspark.sql import SQLContext
import pandas as pd
from pyspark import SparkContext, SparkConf

os.environ['SPARK_HOME'] = "/usr/lib/spark/"
sys.path.append("/usr/lib/spark/python/")

sc =SparkContext()
sql_sc = SQLContext(sc)

情感:

pandas_df = pd.read_csv('{BUCKET}/user2user_relations.csv')
s_df = sql_sc.createDataFrame(pandas_df)

data = sc.textFile('gs://{BUCKET}/user2user_relations.csv')

没有必要解决熊猫问题。我想以火花ALS建议的RDD结尾。 我收到的错误是:

Job [fdcad5bcf77343e2b8782097cd7450cb] submitted.
Waiting for job output...
18/08/08 16:50:20 INFO org.spark_project.jetty.util.log: Logging initialized @2662ms
18/08/08 16:50:20 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/08/08 16:50:20 INFO org.spark_project.jetty.server.Server: Started @2751ms
18/08/08 16:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@41335d47{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/08/08 16:50:21 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.7-hadoop2
18/08/08 16:50:21 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at new-try-gcs-pd-m/10.164.0.2:8032
18/08/08 16:50:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1533726146765_0018
dataproc-d3d8d55c-05b3-4211-adf2-2014ebdbc20c-europe-west4
go-de-internal
gs://dataproc-d3d8d55c-05b3-4211-adf2-2014ebdbc20c-europe-west4/
gs://dataproc-d3d8d55c-05b3-4211-adf2-2014ebdbc20c-europe-west4/user2user_relations.csv
Traceback (most recent call last):
  File "/tmp/fdcad5bcf77343e2b8782097cd7450cb/pyspark_gcs_acess.py", line 19, in <module>
    pandas_df = pd.read_csv(input_file)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 452, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 234, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 542, in __init__
    self._make_engine(self.engine)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 679, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1041, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 332, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3218)
  File "parser.pyx", line 559, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5594)
IOError: File gs://{}/user2user_relations.csv does not exist
18/08/08 16:50:28 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@41335d47{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [fdcad5bcf77343e2b8782097cd7450cb] entered state [ERROR] while waiting for [DONE].

谢谢

2 个答案:

答案 0 :(得分:0)

熊猫不会使用您提供的URI直接从GCS中读取。 与Dataproc中的Spark不同,后者默认情况下已安装GCS连接器。

我建议您是否要从熊猫中读取相同的斑点:

from google.cloud import storage
client = storage.Client()
# https://console.cloud.google.com/storage/browser/[bucket-id]/
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')

df = pd.read_csv(blob.download_as_string())
  • 或者,首先使用命令gsutil cp gs://bucket/blob /local/folder/blob
  • 将文件复制到本地目录
df = pd.read_csv('/local/folder/blob')

希望有帮助。

答案 1 :(得分:0)

您缺少BUCKET值,只需将{BUCKET}替换为您的存储桶名称或设置变量

data = sc.textFile('gs://{BUCKET}/user2user_relations.csv')