Spark命令由第二个字段执行时间序列功能

时间:2017-08-23 11:20:07

标签: apache-spark pyspark

我有一个带时间序列的csv:

timestamp, measure-name, value, type, quality

1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD

measure-name应该是我的分区键,我想用pyspark计算移动平均线,这里是我的代码(例如)来计算最大值

def mysplit(line):
    ll = line.split(",")
    return (ll[1],float(ll[2]))

text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)

但是,对于平均值,我想尊重时间戳订购。

如何按第二列排序?

1 个答案:

答案 0 :(得分:1)

您需要在pyspark数据帧上使用窗口函数:

首先,您应该将rdd转换为数据帧:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])

或者直接将其作为数据框加载:

  • 本地:

    import pandas as pd
    df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
    
  • 来自hdfs的
  • df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
    

然后使用窗口函数:

from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))

    +----------+------------+--------+-----+-------+-------------------+
    | timestamp|measure_name|   value| type|quality|     value_rol_mean|
    +----------+------------+--------+-----+-------+-------------------+
    |1503377340|     x.x-1.B|     0.0|Float|   GOOD|                0.0|
    |1503377400|     x.x-1.B|     0.0|Float|   GOOD|                0.0|
    |1503377580|     x.x-2.A|0.528125|Float|   GOOD|0.17604166666666665|
    +----------+------------+--------+-----+-------+-------------------+
.orderBy

您可以根据需要订购任意数量的列