我在EMR上使用spark 2.1,我的文件按日期存储:
s3://test/2016/07/01/file.gz
s3://test/2016/07/02/file.gz
...
...
s3://test/2017/05/15/file.gz
我只想阅读上个月的数据。 我尝试了这两种解决方案,但它与我的需求不符:
How to read multiple gzipped files from S3 into a single RDD
pyspark select subset of files using regex/glob from s3
这是我的剧本:
from_dt = '2017/01/01'
to_dt = '2017/01/31'
datetime_object = datetime.datetime.strptime(from_dt, '%Y/%m/%d')
datetime_object_2 = datetime.datetime.strptime(to_dt, '%Y/%m/%d')
from datetime import date, timedelta
d1 = datetime_object # start date
d2 = datetime_object_2 # end date
delta = d2 - d1 # timedelta
date_range = []
for i in range(delta.days + 1):
a = (d1 + timedelta(days=i))
a = a.strftime('%Y/%m/%d').replace("-","/")
date_range.append(a)
d = str(date_range).replace('[','{').replace(']','}').replace('\'',"")
print d
'{2017/01/01, 2017/01/02, 2017/01/03, 2017/01/04, 2017/01/05, 2017/01/06, 2017/01/07, 2017/01/08, 2017/01/09, 2017/01/10, 2017/01/11, 2017/01/12, 2017/01/13, 2017/01/14, 2017/01/15, 2017/01/16, 2017/01/17, 2017/01/18, 2017/01/19, 2017/01/20, 2017/01/21, 2017/01/22, 2017/01/23, 2017/01/24, 2017/01/25, 2017/01/26, 2017/01/27, 2017/01/28, 2017/01/29, 2017/01/30, 2017/01/31}'
DF1 = spark.read.csv("s3://test/"+d+"/*", sep='|', header='true')
DF1.count()
output : 7000
当我手动操作路径时,我没有得到相同的结果:
DF2 = spark.read.csv("s3://test/2017/01/*/*", sep='|', header='true')
DF2.count()
output : 230000
答案 0 :(得分:0)
我发现了错误: 日期范围必须是日期之间没有空格的字符串。
@Title %~nx0