从S3加载CSV到PySpark

时间:2016-02-24 21:26:41

标签: python csv amazon-s3 pyspark

使用PySpark我可以使用以下代码加载本地CSV:

cd ./spark-1.6.0-bin-hadoop2.4/

./bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0 --driver-memory 4G

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('/my/local/folder/input_data.csv').write.save("/my/local/folder/input_data", format="parquet")

但我无法使用存储在S3上的(非公开)CSV,因为它超时:

sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@my.bucket/folder/input_data.csv').write.save("/my/local/folder/input_data", format="parquet")


py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Operation timed out

这是否可能,如果是这样,任何想法我做错了什么?提前致谢。

1 个答案:

答案 0 :(得分:-2)

您是否尝试从纯文本加载rdd? csv文件拆分非常简单

sc.textFile('/my/local/folder/input_data.csv').map(lambda row:row.split(','))