Pyspark与多个数据帧并行读取多个CSV文件

时间:2018-06-05 22:33:19

标签: apache-spark pyspark

我在S3存储桶中有多个csv文件。我知道我可以使用带有通配符的sparkSession.read一次性读取所有内容:

 sparkSession.read.csv(path + "/*")

但这会创建一个Dataframe。我需要单独处理每个文件,但要并行处理。我按照这里的方法http://michaelryanbell.com/processing-whole-files-spark-s3.html

 def fetchData(key):
     return spark.read.option("header", "true").option("inferSchema", "true").csv(buildS3Path(bucket) + key['Key'])


 files = s3C.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
 files = sc.parallelize(files)
 files.map(fetchData).foreach(processData)

我收到以下错误:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

如何将多个csv并行读取到单独的数据帧中进行处理?

0 个答案:

没有答案