使用Sparklyr计算订单统计信息(百分位数)

时间:2019-09-11 13:33:51

标签: r apache-spark dplyr sparklyr

dplyr的一个有用功能是能够使用mutate快速创建计算列,其中一个计算是quantile,这是我以前可以使用{{1}进行的},其功能为sparklyr,但由于某种原因它不再起作用,这是一个详细的示例。

首先创建一个样本数据集:

percentile

现在计算第90个百分位数

require(dplyr)
require(sparklyr)

# sc is a connection to spark

my_df <- data.frame(col1 = sample(1:100,30)) %>%  as_tibble()

my_df 

# # A tibble: 30 x 1
# col1
# <int>
# 1    91
# 2     1
# 3    15
# 4    42
# 5    36
# 6    18
# 7    35
# 8    98
# 9    60
# 10    24
# # ... with 20 more rows

有火花

my_df %>%  mutate(pct_90 = quantile(col1, .9))

# # A tibble: 30 x 2
# col1 pct_90
# <int>  <dbl>
# 1    91   84.7
# 2     1   84.7
# 3    15   84.7
# 4    42   84.7
# 5    36   84.7
# 6    18   84.7
# 7    35   84.7
# 8    98   84.7
# 9    60   84.7
# 10    24   84.7
# # ... with 20 more rows

现在计算第90个百分位数

my_spark_df <- copy_to(sc, my_df, 'my_spark_df')

my_spark_df 


# # Source: spark<my_spark_df> [?? x 1]
# col1
# * <int>
# 1    91
# 2     1
# 3    15
# 4    42
# 5    36
# 6    18
# 7    35
# 8    98
# 9    60
# 10    24
# # ... with more rows
# 

会话信息

my_spark_df %>%  mutate(pct_90 = percentile(col1, .9))


Error: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'my_spark_df.`col1`' is not an aggregate function. Wrap '(percentile(my_spark_df.`col1`, CAST(0.9BD AS DOUBLE), 1L) AS `pct_90`)' in windowing function(s) or wrap 'my_spark_df.`col1`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [col1#6964, percentile(col1#6964, cast(0.9 as double), 1, 0, 0) AS pct_90#7030]
+- SubqueryAlias `my_spark_df`
   +- LogicalRDD [col1#6964], false

0 个答案:

没有答案