PySpark:如何在For循环中追加数据帧

时间:2019-05-29 15:02:40

标签: apache-spark pyspark time-series user-defined-functions

我正在对单个时间序列数据帧执行滚动中值计算,然后我想合并/追加结果。

# UDF for rolling median
median_udf = udf(lambda x: float(np.median(x)), FloatType())

series_list = ['0620', '5914']
SeriesAppend=[]

for item in series_list:
    # Filter for select item
    series = test_df.where(col("ID").isin([item]))
    # Sort time series
    series_sorted = series.sort(series.ID, 
    series.date).persist()
    # Calculate rolling median
    series_sorted = series_sorted.withColumn("list", 
        collect_list("metric").over(w)) \
        .withColumn("rolling_median", median_udf("list"))

    SeriesAppend.append(series_sorted)

SeriesAppend

[DataFrame [ntwrk_genre_cd:字符串,日期:日期,mkt_cd:字符串,syscode:字符串,ntwrk_cd:字符串,syscode_ntwrk:字符串,metric:双精度,list:数组,rolling_median:浮点数],DataFrame [ntwrk_genre_cd:字符串,日期:日期,mkt_cd:字符串,系统代码:字符串,ntwrk_cd:字符串,syscode_ntwrk:字符串,指标:双精度,列表:数组,rolling_median:浮点数]]

当我尝试.show()时:

'list' object has no attribute 'show'
Traceback (most recent call last):
AttributeError: 'list' object has no attribute 'show'

我意识到这是说对象是数据框的列表。如何转换为单个数据框?

我知道以下解决方案适用于显式个数据帧,但是我希望我的for循环不了解数据帧的数目:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)

是否可以将其概括为非明确的数据框名称?

1 个答案:

答案 0 :(得分:1)

谢谢大家!综上所述-该解决方案使用Reduce和unionAll:

SeriesAppend=[]

for item in series_list:
    # Filter for select item
    series = test_df.where(col("ID").isin([item]))
    # Sort time series
    series_sorted = series.sort(series.ID, 
    series.date).persist()
    # Calculate rolling median
    series_sorted = series_sorted.withColumn("list", 
         collect_list("metric").over(w)) \
         .withColumn("rolling_median", median_udf("list"))

    SeriesAppend.append(series_sorted)

df_series = reduce(DataFrame.unionAll, SeriesAppend)