我有一个csv文件,其结构类似于以下内容。
INDEX,SYMBOL,DATETIMETS,PRICE,SIZE
0,A,2002-12-02 9:30:20,19.75,30200
1,A,2002-12-02 9:30:22,19.75,100
2,A,2002-12-02 9:30:22,19.75,300
3,A,2002-12-02 9:30:22,19.75,100
4,A,2002-12-02 9:30:23,19.75,100
5,A,2002-12-02 9:30:23,19.75,100
6,A,2002-12-02 9:30:23,19.75,100
7,A,2002-12-02 9:30:23,19.75,100
.......
.......
多年来有超过一百万行。 我已将此csv文件加载到spark数据帧(pyspark)中。 以5分钟的间隔获得平均价格的最快方法是什么?
我目前正在做的是循环整个数据集,并以5分钟的间隔查询时间。 e.g。
filteredSqlString = ("SELECT PRICE FROM DF WHERE DATETIMETS >= '" + str(sdt) + "'"
+ " AND DATETIMETS < '" + str(idt) + "'")
filtered_df = sqlContext.sql(filteredSqlString);
MEAN_PRICE = filtered_df.select([mean("PRICE")]).first()[0];
并通过递增开始日期时间和结束日期时间来循环运行
sdt = idt;
idt = sdt + timedelta(minutes=5);
这种方法将永远持续下去。是否有更快的方法来实现这一目标?
答案 0 :(得分:1)
我认为这个应该是更好的解决方案。
给出一些意见:
schema = StructType([
StructField("INDEX", IntegerType(), True),
StructField("SYMBOL", StringType(), True),
StructField("DATETIMETS", StringType(), True),
StructField("PRICE", DoubleType(), True),
StructField("SIZE", IntegerType(), True),
])
df = spark\
.createDataFrame(
data=[(0,'A','2002-12-02 9:30:20',19.75,30200),
(1,'A','2002-12-02 9:31:20',19.75,30200),
(2,'A','2002-12-02 9:35:20',19.75,30200),
(3,'A','2002-12-02 9:36:20',1.0,30200),
(4,'A','2002-12-02 9:41:20',20.0,30200),
(4,'A','2002-12-02 9:42:20',40.0,30200),
(5,'A','2003-12-02 11:28:20',19.75,30200),
(6,'A','2003-12-02 11:31:20',19.75,30200),
(7,'A','2003-12-02 12:35:20',19.75,30200),
(8,'A','2004-12-02 10:36:20',1.0,30200),
(9,'A','2006-12-02 22:41:20',20.0,30200),
(10,'A','2006-12-02 22:42:20',40.0,30200)],
schema=schema)
让我们创建兴趣区间:
intervals = []
for i in range(0,61,5):
intervals.append(i)
print(intervals)
这是:
[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60]
然后我们需要分组的一些UDF:
u_get_year = udf(lambda col : col[:10])
u_get_hour = udf(lambda col : col.strip().split(" ")[1].split(':')[0], StringType())
def get_interval(col):
curr = int(col.strip().split(" ")[1].split(':')[1])
for idx,interval in enumerate(intervals):
if intervals[idx] <= curr < intervals[idx+1]:
return "{}-{}".format(intervals[idx],intervals[idx+1])
return ""
u_get_interval = udf(get_interval, StringType())
最后让我们执行操作:
df2 = df.withColumn('DATE',u_get_year('DATETIMETS'))\
.withColumn('HOUR', u_get_hour('DATETIMETS'))\
.withColumn('INTERVAL', u_get_interval('DATETIMETS'))\
.drop('DATETIMETS')
df2.groupBy('DATE', 'HOUR', 'INTERVAL').agg(mean('PRICE'))\
.orderBy('DATE', 'HOUR', 'INTERVAL').show()
输出:
+----------+----+--------+----------+
|DATE |HOUR|INTERVAL|avg(PRICE)|
+----------+----+--------+----------+
|2002-12-02|9 |30-35 |19.75 |
|2002-12-02|9 |35-40 |10.375 |
|2002-12-02|9 |40-45 |30.0 |
|2003-12-02|11 |25-30 |19.75 |
|2003-12-02|11 |30-35 |19.75 |
|2003-12-02|12 |35-40 |19.75 |
|2004-12-02|10 |35-40 |1.0 |
|2006-12-02|22 |40-45 |30.0 |
+----------+----+--------+----------+