Question

我在S3存储桶中有多个csv文件。我知道我可以使用带有通配符的sparkSession.read一次性读取所有内容：

 sparkSession.read.csv(path + "/*")

但这会创建一个Dataframe。我需要单独处理每个文件，但要并行处理。我按照这里的方法http://michaelryanbell.com/processing-whole-files-spark-s3.html。

 def fetchData(key):
     return spark.read.option("header", "true").option("inferSchema", "true").csv(buildS3Path(bucket) + key['Key'])


 files = s3C.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
 files = sc.parallelize(files)
 files.map(fetchData).foreach(processData)

我收到以下错误：

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

如何将多个csv并行读取到单独的数据帧中进行处理？

Pyspark与多个数据帧并行读取多个CSV文件

0 个答案: