获取大型数据集的时间间隔的平均价格

时间:2017-07-01 05:17:20

标签: python pyspark pyspark-sql

我有一个csv文件,其结构类似于以下内容。

INDEX,SYMBOL,DATETIMETS,PRICE,SIZE
0,A,2002-12-02 9:30:20,19.75,30200
1,A,2002-12-02 9:30:22,19.75,100
2,A,2002-12-02 9:30:22,19.75,300
3,A,2002-12-02 9:30:22,19.75,100
4,A,2002-12-02 9:30:23,19.75,100
5,A,2002-12-02 9:30:23,19.75,100
6,A,2002-12-02 9:30:23,19.75,100
7,A,2002-12-02 9:30:23,19.75,100
.......
.......

多年来有超过一百万行。 我已将此csv文件加载到spark数据帧(pyspark)中。 以5分钟的间隔获得平均价格的最快方法是什么?

我目前正在做的是循环整个数据集,并以5分钟的间隔查询时间。 e.g。

filteredSqlString =  ("SELECT PRICE FROM DF WHERE DATETIMETS >= '" + str(sdt) + "'"
                        + " AND DATETIMETS < '" + str(idt) + "'")
filtered_df = sqlContext.sql(filteredSqlString);
MEAN_PRICE = filtered_df.select([mean("PRICE")]).first()[0];

并通过递增开始日期时间和结束日期时间来循环运行

sdt = idt;
idt = sdt + timedelta(minutes=5);

这种方法将永远持续下去。是否有更快的方法来实现这一目标?

1 个答案:

答案 0 :(得分:1)

我认为这个应该是更好的解决方案。

给出一些意见:

schema = StructType([
    StructField("INDEX", IntegerType(), True),
    StructField("SYMBOL", StringType(), True),
    StructField("DATETIMETS", StringType(), True),
    StructField("PRICE", DoubleType(), True),
    StructField("SIZE", IntegerType(), True),
])

df = spark\
    .createDataFrame(
        data=[(0,'A','2002-12-02 9:30:20',19.75,30200),
             (1,'A','2002-12-02 9:31:20',19.75,30200),
             (2,'A','2002-12-02 9:35:20',19.75,30200),
             (3,'A','2002-12-02 9:36:20',1.0,30200),
             (4,'A','2002-12-02 9:41:20',20.0,30200),
             (4,'A','2002-12-02 9:42:20',40.0,30200),
             (5,'A','2003-12-02 11:28:20',19.75,30200),
             (6,'A','2003-12-02 11:31:20',19.75,30200),
             (7,'A','2003-12-02 12:35:20',19.75,30200),
             (8,'A','2004-12-02 10:36:20',1.0,30200),
             (9,'A','2006-12-02 22:41:20',20.0,30200),
             (10,'A','2006-12-02 22:42:20',40.0,30200)],
        schema=schema)

让我们创建兴趣区间:

intervals = []
for i in range(0,61,5):
    intervals.append(i)
print(intervals)

这是:

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60]

然后我们需要分组的一些UDF:

u_get_year = udf(lambda col : col[:10])
u_get_hour = udf(lambda col : col.strip().split(" ")[1].split(':')[0], StringType())

def get_interval(col):
    curr = int(col.strip().split(" ")[1].split(':')[1])

    for idx,interval in enumerate(intervals):
        if intervals[idx] <= curr < intervals[idx+1]:
            return "{}-{}".format(intervals[idx],intervals[idx+1])

    return ""

u_get_interval = udf(get_interval, StringType())

最后让我们执行操作:

df2 = df.withColumn('DATE',u_get_year('DATETIMETS'))\
        .withColumn('HOUR', u_get_hour('DATETIMETS'))\
        .withColumn('INTERVAL', u_get_interval('DATETIMETS'))\
        .drop('DATETIMETS')

df2.groupBy('DATE', 'HOUR', 'INTERVAL').agg(mean('PRICE'))\
        .orderBy('DATE', 'HOUR', 'INTERVAL').show()

输出:

+----------+----+--------+----------+
|DATE      |HOUR|INTERVAL|avg(PRICE)|
+----------+----+--------+----------+
|2002-12-02|9   |30-35   |19.75     |
|2002-12-02|9   |35-40   |10.375    |
|2002-12-02|9   |40-45   |30.0      |
|2003-12-02|11  |25-30   |19.75     |
|2003-12-02|11  |30-35   |19.75     |
|2003-12-02|12  |35-40   |19.75     |
|2004-12-02|10  |35-40   |1.0       |
|2006-12-02|22  |40-45   |30.0      |
+----------+----+--------+----------+