假设我有一个名为transactions
的DataFrame,其中包含以下整数列:year
,month
,day
,timestamp
,transaction_id
。
In [1]: transactions = ctx.createDataFrame([(2017, 12, 1, 10000, 1), (2017, 12, 2, 10001, 2), (2017, 12, 3, 10003, 3), (2017, 12, 4, 10004, 4), (2017, 12, 5, 10005, 5), (2017, 12, 6, 10006, 6)],('year', 'month', 'day', 'timestamp', 'transaction_id'))
In [2]: transactions.show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10000| 1|
|2017| 12| 2| 10001| 2|
|2017| 12| 3| 10003| 3|
|2017| 12| 4| 10004| 4|
|2017| 12| 5| 10005| 5|
|2017| 12| 6| 10006| 6|
+----+-----+---+---------+--------------+
我想定义一个函数filter_date_range
,它返回一个DataFrame,该数据框由一些日期范围内的事务行组成。
>>> filter_date_range(
df = transactions,
start_date = datetime.date(2017, 12, 2),
end_date = datetime.date(2017, 12, 4)).show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10001| 2|
|2017| 12| 1| 10003| 3|
|2017| 12| 1| 10004| 4|
+----+-----+---+---------+--------------+
假设数据保存在Hive分区中,由year
,month
,day
分区,执行这样涉及日期算术的过滤器的最有效方法是什么?我正在寻找一种以纯粹的DataFrame-ic方式执行此操作的方法,而不需要使用transactions.rdd
,因此Spark可以推断实际上只需要读取一部分分区。
答案 0 :(得分:1)
如果数据分区如下:
.
├── _SUCCESS
└── year=2017
└── month=12
├── day=1
│ └── part-0...parquet
├── day=2
│ └── part-0...parquet
├── day=3
│ └── part-0...parquet
├── day=4
│ └── part-0...parquet
├── day=5
│ └── part-0...parquet
└── day=6
└── part-0...parquet
你可以生成一个要加载的目录列表:
start_date = datetime.date(2017, 12, 2)
end_date = datetime.date(2017, 12, 4)
n = (end_date - start_date).days + 1
base_path = ...
paths = [
"{}/year={}/month={}/day={}".format(base_path, d.year, d.month, d.day)
for d in [start_date + datetime.timedelta(days=i) for i in range(n)]
]
spark.read.option("basePath", base_path).load(paths).explain()
# == Parsed Logical Plan ==
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
#
# == Analyzed Logical Plan ==
# timestamp: bigint, transaction_id: bigint, year: int, month: int, day: int
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
#
# == Optimized Logical Plan ==
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
#
# == Physical Plan ==
# *FileScan parquet [timestamp#47L,transaction_id#48L,year#49,month#50,day#51] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/transactions/year=2017/month=12/day=2, file:/user/hiv..., PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<timestamp:bigint,transaction_id:bigint>
参考: