对按日

时间:2018-01-29 20:58:23

标签: python apache-spark pyspark

假设我有一个名为transactions的DataFrame,其中包含以下整数列:yearmonthdaytimestamptransaction_id

In [1]: transactions = ctx.createDataFrame([(2017, 12, 1, 10000, 1), (2017, 12, 2, 10001, 2), (2017, 12, 3, 10003, 3), (2017, 12, 4, 10004, 4), (2017, 12, 5, 10005, 5), (2017, 12, 6, 10006, 6)],('year', 'month', 'day', 'timestamp', 'transaction_id'))

In [2]: transactions.show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017|   12|  1|    10000|             1|
|2017|   12|  2|    10001|             2|
|2017|   12|  3|    10003|             3|
|2017|   12|  4|    10004|             4|
|2017|   12|  5|    10005|             5|
|2017|   12|  6|    10006|             6|
+----+-----+---+---------+--------------+

我想定义一个函数filter_date_range,它返回一个DataFrame,该数据框由一些日期范围内的事务行组成。

>>> filter_date_range(  
        df = transactions, 
        start_date = datetime.date(2017, 12, 2), 
        end_date = datetime.date(2017, 12, 4)).show()

+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017|   12|  1|    10001|             2|
|2017|   12|  1|    10003|             3|
|2017|   12|  1|    10004|             4|
+----+-----+---+---------+--------------+

假设数据保存在Hive分区中,由yearmonthday分区,执行这样涉及日期算术的过滤器的最有效方法是什么?我正在寻找一种以纯粹的DataFrame-ic方式执行此操作的方法,而不需要使用transactions.rdd,因此Spark可以推断实际上只需要读取一部分分区。

1 个答案:

答案 0 :(得分:1)

如果数据分区如下:

.
├── _SUCCESS
└── year=2017
    └── month=12
        ├── day=1
        │   └── part-0...parquet
        ├── day=2
        │   └── part-0...parquet
        ├── day=3
        │   └── part-0...parquet
        ├── day=4
        │   └── part-0...parquet
        ├── day=5
        │   └── part-0...parquet
        └── day=6
            └── part-0...parquet

你可以生成一个要加载的目录列表:

start_date = datetime.date(2017, 12, 2)
end_date = datetime.date(2017, 12, 4)
n = (end_date - start_date).days + 1

base_path = ...

paths = [
    "{}/year={}/month={}/day={}".format(base_path, d.year, d.month, d.day) 
    for d in [start_date + datetime.timedelta(days=i) for i in  range(n)]
]

spark.read.option("basePath", base_path).load(paths).explain()

# == Parsed Logical Plan ==
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
# 
# == Analyzed Logical Plan ==
# timestamp: bigint, transaction_id: bigint, year: int, month: int, day: int
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
# 
# == Optimized Logical Plan ==
# Relation[timestamp#47L,transaction_id#48L,year#49,month#50,day#51] parquet
# 
# == Physical Plan ==
# *FileScan parquet [timestamp#47L,transaction_id#48L,year#49,month#50,day#51] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/transactions/year=2017/month=12/day=2, file:/user/hiv..., PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<timestamp:bigint,transaction_id:bigint>

参考: