我已经达到了可以跟随示例here的步骤(仅对输入参数添加config=list()
进行了轻微修改)。
sc <- spark_connect(master = "yarn-client", config=list())
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
flights_tbl %>% filter(dep_delay == 2)
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
<int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 2 830 11 "UA" "N14228" 1545 "EWR" "IAH" 227 1400 5 17
2 2013 1 1 542 2 923 33 "AA" "N619AA" 1141 "JFK" "MIA" 160 1089 5 42
3 2013 1 1 702 2 1058 44 "B6" "N779JB" 671 "JFK" "LAX" 381 2475 7 2
4 2013 1 1 715 2 911 21 "UA" "N841UA" 544 "EWR" "ORD" 156 719 7 15
5 2013 1 1 752 2 1025 -4 "UA" "N511UA" 477 "LGA" "DEN" 249 1620 7 52
6 2013 1 1 917 2 1206 -5 "B6" "N568JB" 41 "JFK" "MCO" 145 944 9 17
7 2013 1 1 932 2 1219 -6 "VX" "N641VA" 251 "JFK" "LAS" 324 2248 9 32
8 2013 1 1 1028 2 1350 11 "UA" "N76508" 1004 "LGA" "IAH" 237 1416 10 28
9 2013 1 1 1042 2 1325 -1 "B6" "N529JB" 31 "JFK" "MCO" 142 944 10 42
10 2013 1 1 1231 2 1523 -6 "UA" "N402UA" 428 "EWR" "FLL" 156 1065 12 31
# ... with more rows
然而,当我尝试使用其他R函数时,例如dplyr
可能会出现问题:
flights_tbl %>% filter(dep_delay == 2 & grepl("A$", tailnum))
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function GREPL; line 4 pos 41
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.
显然不支持grepl
。我的问题是:有没有办法使用基本R或R包函数?如果没有,它会来吗?似乎这些方面的工作正在dapply
和gapply
SparkR
进行,但如果它与sparklyr
一起使用会很好。