我有以下y <- 2
z <- 3
test0 <- function(x, var){
y <- 1
x + eval(substitute(var))
}
# opps, the value of y is the one defined in the body
test0(0, y)
#> [1] 1
test0(0, z)
#> [1] 3
# but it will work using eval.parent :
test1 <- function(x, var){
y <- 1
x + eval.parent(substitute(var))
}
test1(0, y)
#> [1] 2
test1(0, z)
#> [1] 3
# in some cases (better avoided), it can be easier/quick and dirty to do something like :
test2 <- function(x, var){
y <- 1
# whatever code using y
rm(y)
x + eval(substitute(var))
}
test2(0, y)
#> [1] 2
test2(0, z)
#> [1] 3
数据帧
spark
我想df = spark.createDataFrame([['2017-04-01', 'A',1 , 1],
['2017-04-01', 'B',2,3],
['2017-04-01', 'B',3,4],
['2017-04-01', 'A',5,5]], schema=['pdate', 'url', 'weight', 'imp'])
groupby
对url
进行以下操作,并将结果分配给新列:
df
(min_pdate作为别名)min
(最大别名用作别名)pdate
(sum_imp作为别名)max
(wmean_imp作为别名)是否有一种使用pyspark的简洁方法?
答案 0 :(得分:2)
只需使用agg
函数将许多功能应用于groupBy
import pyspark.sql.functions as f
from pyspark.shell import spark
df = spark.createDataFrame([['2017-03-01', 'A', 1, 1],
['2017-04-01', 'B', 2, 3],
['2017-05-01', 'B', 3, 4],
['2017-06-01', 'A', 5, 5]], schema=['pdate', 'url', 'weight', 'imp'])
df = df \
.groupBy(f.col('url')) \
.agg(f.min('pdate').alias('min_pdate'),
f.max('pdate').alias('max_pdate'),
f.sum('imp').alias('sum_imp'),
(f.sum(f.col('imp') * f.col('weight')) / f.sum(f.col('weight'))).alias('wmean_imp'))
df.show()
输出:
+---+----------+----------+-------+-----------------+
|url| min_pdate| max_pdate|sum_imp| wmean_imp|
+---+----------+----------+-------+-----------------+
| B|2017-04-01|2017-05-01| 7| 3.6|
| A|2017-03-01|2017-06-01| 6|4.333333333333333|
+---+----------+----------+-------+-----------------+