我正在尝试使用Pyspark(2.4)和hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar来读取S3(伪造-localstack)镶木地板分区文件。这些文件按event_year = YYYY / event_month = MM / event_day = DD进行分区,因此我正在使用basePath选项。
paths= ['s3://ubaevents/events/org_pk=2/event_year=2018/event_month=11/','s3://ubaevents/events/org_pk=2/event_year=2018/event_month=12/']
base_path = 's3://ubaevents/events/'
df = spark.read.option(basePath=base_path).parquet(*paths)
df = spark.read.options(basePath = base_path).parquet(* paths) 追溯(最近一次通话): 文件“”,第1行,位于 实木复合地板中的文件“ /Users/amgonen/PycharmProjects/cyber-intel/venv/lib/python2.7/site-packages/pyspark/sql/readwriter.py”,第316行 返回self._df(self._jreader.parquet(_to_seq(self._spark._sc,路径))) 文件“ /Users/amgonen/PycharmProjects/cyber-intel/venv/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,行1257 ,在致电中 装饰中的文件“ /Users/amgonen/PycharmProjects/cyber-intel/venv/lib/python2.7/site-packages/pyspark/sql/utils.py”,第79行 引发IllegalArgumentException(s.split(':',1)[1],stackTrace) pyspark.sql.utils.IllegalArgumentException:u“选项'basePath'必须是目录”