我在S3存储桶中有多个csv文件。我知道我可以使用带有通配符的sparkSession.read一次性读取所有内容:
sparkSession.read.csv(path + "/*")
但这会创建一个Dataframe。我需要单独处理每个文件,但要并行处理。我按照这里的方法http://michaelryanbell.com/processing-whole-files-spark-s3.html。
def fetchData(key):
return spark.read.option("header", "true").option("inferSchema", "true").csv(buildS3Path(bucket) + key['Key'])
files = s3C.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
files = sc.parallelize(files)
files.map(fetchData).foreach(processData)
我收到以下错误:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
如何将多个csv并行读取到单独的数据帧中进行处理?