dplyr的一个有用功能是能够使用mutate
快速创建计算列,其中一个计算是quantile
,这是我以前可以使用{{1}进行的},其功能为sparklyr
,但由于某种原因它不再起作用,这是一个详细的示例。
首先创建一个样本数据集:
percentile
现在计算第90个百分位数
require(dplyr)
require(sparklyr)
# sc is a connection to spark
my_df <- data.frame(col1 = sample(1:100,30)) %>% as_tibble()
my_df
# # A tibble: 30 x 1
# col1
# <int>
# 1 91
# 2 1
# 3 15
# 4 42
# 5 36
# 6 18
# 7 35
# 8 98
# 9 60
# 10 24
# # ... with 20 more rows
有火花
my_df %>% mutate(pct_90 = quantile(col1, .9))
# # A tibble: 30 x 2
# col1 pct_90
# <int> <dbl>
# 1 91 84.7
# 2 1 84.7
# 3 15 84.7
# 4 42 84.7
# 5 36 84.7
# 6 18 84.7
# 7 35 84.7
# 8 98 84.7
# 9 60 84.7
# 10 24 84.7
# # ... with 20 more rows
现在计算第90个百分位数
my_spark_df <- copy_to(sc, my_df, 'my_spark_df')
my_spark_df
# # Source: spark<my_spark_df> [?? x 1]
# col1
# * <int>
# 1 91
# 2 1
# 3 15
# 4 42
# 5 36
# 6 18
# 7 35
# 8 98
# 9 60
# 10 24
# # ... with more rows
#
会话信息
my_spark_df %>% mutate(pct_90 = percentile(col1, .9))
Error: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'my_spark_df.`col1`' is not an aggregate function. Wrap '(percentile(my_spark_df.`col1`, CAST(0.9BD AS DOUBLE), 1L) AS `pct_90`)' in windowing function(s) or wrap 'my_spark_df.`col1`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [col1#6964, percentile(col1#6964, cast(0.9 as double), 1, 0, 0) AS pct_90#7030]
+- SubqueryAlias `my_spark_df`
+- LogicalRDD [col1#6964], false