使用PySpark我可以使用以下代码加载本地CSV:
cd ./spark-1.6.0-bin-hadoop2.4/
./bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0 --driver-memory 4G
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('/my/local/folder/input_data.csv').write.save("/my/local/folder/input_data", format="parquet")
但我无法使用存储在S3上的(非公开)CSV,因为它超时:
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@my.bucket/folder/input_data.csv').write.save("/my/local/folder/input_data", format="parquet")
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Operation timed out
这是否可能,如果是这样,任何想法我做错了什么?提前致谢。
答案 0 :(得分:-2)
您是否尝试从纯文本加载rdd? csv文件拆分非常简单
sc.textFile('/my/local/folder/input_data.csv').map(lambda row:row.split(','))