是否可以使用本机R代码或其他R包函数与sparklyr?

时间:2016-08-18 20:31:44

标签: r sparkr sparklyr

我已经达到了可以跟随示例here的步骤(仅对输入参数添加config=list()进行了轻微修改)。

sc <- spark_connect(master = "yarn-client", config=list())
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
flights_tbl %>% filter(dep_delay == 2)

Source:   query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

    year month   day dep_time dep_delay arr_time arr_delay carrier  tailnum     flight origin  dest air_time distance  hour minute
   <int> <int> <int>    <int>     <dbl>    <int>     <dbl>   <chr>    <chr>      <int>  <chr> <chr>    <dbl>    <dbl> <dbl>  <dbl>
1   2013     1     1      517         2      830        11    "UA" "N14228"       1545  "EWR" "IAH"      227     1400     5     17
2   2013     1     1      542         2      923        33    "AA" "N619AA"       1141  "JFK" "MIA"      160     1089     5     42
3   2013     1     1      702         2     1058        44    "B6" "N779JB"        671  "JFK" "LAX"      381     2475     7      2
4   2013     1     1      715         2      911        21    "UA" "N841UA"        544  "EWR" "ORD"      156      719     7     15
5   2013     1     1      752         2     1025        -4    "UA" "N511UA"        477  "LGA" "DEN"      249     1620     7     52
6   2013     1     1      917         2     1206        -5    "B6" "N568JB"         41  "JFK" "MCO"      145      944     9     17
7   2013     1     1      932         2     1219        -6    "VX" "N641VA"        251  "JFK" "LAS"      324     2248     9     32
8   2013     1     1     1028         2     1350        11    "UA" "N76508"       1004  "LGA" "IAH"      237     1416    10     28
9   2013     1     1     1042         2     1325        -1    "B6" "N529JB"         31  "JFK" "MCO"      142      944    10     42
10  2013     1     1     1231         2     1523        -6    "UA" "N402UA"        428  "EWR" "FLL"      156     1065    12     31
# ... with more rows

然而,当我尝试使用其他R函数时,例如dplyr可能会出现问题:

flights_tbl %>% filter(dep_delay == 2 & grepl("A$", tailnum)) 
Source:   query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

Error: org.apache.spark.sql.AnalysisException: undefined function GREPL; line 4 pos 41
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.

显然不支持grepl。我的问题是:有没有办法使用基本R或R包函数?如果没有,它会来吗?似乎这些方面的工作正在dapplygapply SparkR进行,但如果它与sparklyr一起使用会很好。

1 个答案:

答案 0 :(得分:2)

刚看到this issue代表闪闪发光。简短的回答是&#34;还没有&#34;。期待未来版本添加此功能。