如何在spark数据帧的不同列中应用许多操作并使用新别名保存它们

时间:2019-07-08 15:30:59

标签: python python-3.x apache-spark pyspark

我有以下y <- 2 z <- 3 test0 <- function(x, var){ y <- 1 x + eval(substitute(var)) } # opps, the value of y is the one defined in the body test0(0, y) #> [1] 1 test0(0, z) #> [1] 3 # but it will work using eval.parent : test1 <- function(x, var){ y <- 1 x + eval.parent(substitute(var)) } test1(0, y) #> [1] 2 test1(0, z) #> [1] 3 # in some cases (better avoided), it can be easier/quick and dirty to do something like : test2 <- function(x, var){ y <- 1 # whatever code using y rm(y) x + eval(substitute(var)) } test2(0, y) #> [1] 2 test2(0, z) #> [1] 3 数据帧

spark

我想df = spark.createDataFrame([['2017-04-01', 'A',1 , 1], ['2017-04-01', 'B',2,3], ['2017-04-01', 'B',3,4], ['2017-04-01', 'A',5,5]], schema=['pdate', 'url', 'weight', 'imp']) groupbyurl进行以下操作,并将结果分配给新列:

  • {{1}中的df(min_pdate作为别名)
  • {{1}中的min(最大别名用作别名)
  • {{1}中的pdate(sum_imp作为别名)
  • {{1}中的max(wmean_imp作为别名)

是否有一种使用pyspark的简洁方法?

1 个答案:

答案 0 :(得分:2)

只需使用agg函数将许多功能应用于groupBy

import pyspark.sql.functions as f

from pyspark.shell import spark

df = spark.createDataFrame([['2017-03-01', 'A', 1, 1],
                            ['2017-04-01', 'B', 2, 3],
                            ['2017-05-01', 'B', 3, 4],
                            ['2017-06-01', 'A', 5, 5]], schema=['pdate', 'url', 'weight', 'imp'])

df = df \
    .groupBy(f.col('url')) \
    .agg(f.min('pdate').alias('min_pdate'),
         f.max('pdate').alias('max_pdate'),
         f.sum('imp').alias('sum_imp'),
         (f.sum(f.col('imp') * f.col('weight')) / f.sum(f.col('weight'))).alias('wmean_imp'))
df.show()

输出:

+---+----------+----------+-------+-----------------+
|url| min_pdate| max_pdate|sum_imp|        wmean_imp|
+---+----------+----------+-------+-----------------+
|  B|2017-04-01|2017-05-01|      7|              3.6|
|  A|2017-03-01|2017-06-01|      6|4.333333333333333|
+---+----------+----------+-------+-----------------+