如何计算每行给定索引之前和之后的行平均值 - pyspark?

时间:2018-04-11 09:25:45

标签: pyspark spark-dataframe

我有一个包含多列和索引的数据框,我必须在索引之前和之后计算这些列的平均值。

这是我的熊猫代码:

for i in range(len(res.index)):
    i=int(i)
    m=int(res['index'].ix[i])
    n = len(res.columns[1:m])
    if n == 0:
        res['mean'].ix[i]=0
    else:
        res['mean'].ix[i]=int(res.ix[i,1:m].sum()) / n

我想在pyspark做这件事吗? 请帮忙!!

1 个答案:

答案 0 :(得分:0)

您可以使用UDF中的pyspark进行计算。这是一个例子: -

from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np


sample_data = sqlContext.createDataFrame([
    range(10)+[4],
    range(50, 60)+[2],
    range(9, 19)+[4],
    range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()


+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    0|    1|    2|    3|    4|    5|    6|    7|    8|    9|    4|
|   50|   51|   52|   53|   54|   55|   56|   57|   58|   59|    2|
|    9|   10|   11|   12|   13|   14|   15|   16|   17|   18|    4|
|   19|   20|   21|   22|   23|   24|   25|   26|   27|   28|    3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+


def def_mn(data, index, mean="pre"):
    if mean == "pre":
        return sum(data[:index])/float(len(data[:index]))
    elif mean == "post":
        return sum(data[index:])/float(len(data[index:]))

mn_udf = F.udf(def_mn)

sample_data.withColumn(
    "index_pre_mean", 
    mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
    "index_post_mean", 
    mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0    |1    |2    |3    |4    |5    |6    |7    |8    |9    |4    |1.5           |6.5            |
|50   |51   |52   |53   |54   |55   |56   |57   |58   |59   |2    |50.5          |55.5           |
|9    |10   |11   |12   |13   |14   |15   |16   |17   |18   |4    |10.5          |15.5           |
|19   |20   |21   |22   |23   |24   |25   |26   |27   |28   |3    |20.0          |25.0           |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+