我有一个带时间序列的csv:
timestamp, measure-name, value, type, quality
1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD
measure-name应该是我的分区键,我想用pyspark计算移动平均线,这里是我的代码(例如)来计算最大值
def mysplit(line):
ll = line.split(",")
return (ll[1],float(ll[2]))
text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)
但是,对于平均值,我想尊重时间戳订购。
如何按第二列排序?
答案 0 :(得分:1)
您需要在pyspark数据帧上使用窗口函数:
首先,您应该将rdd转换为数据帧:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])
或者直接将其作为数据框加载:
本地:
import pandas as pd
df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
:
df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
然后使用窗口函数:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))
+----------+------------+--------+-----+-------+-------------------+
| timestamp|measure_name| value| type|quality| value_rol_mean|
+----------+------------+--------+-----+-------+-------------------+
|1503377340| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377400| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377580| x.x-2.A|0.528125|Float| GOOD|0.17604166666666665|
+----------+------------+--------+-----+-------+-------------------+
在.orderBy
中您可以根据需要订购任意数量的列