我正在使用SpraklyR将csv读入spark中
schema <- structType(structField("TransTime", "array<timestamp>", TRUE),
structField("TransDay", "Date", TRUE))
spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema = schema)
但是得到:
Error: could not find function "structType"
如何使用spark_read_csv指定colunm类型?
提前致谢。
答案 0 :(得分:7)
structType函数来自Scala的SparkAPI,在Sparklyr中指定必须在&#34;列中传递它的数据类型&#34;参数作为列表,假设我们有以下CSV(data.csv):
name,birthdate,age,height
jader,1994-10-31,22,1.79
maria,1900-03-12,117,1.32
读取相应数据的功能是:
mycsv <- spark_read_csv(sc, "mydate",
path = "data.csv",
memory = TRUE,
infer_schema = FALSE, #attention to this
columns = list(
name = "character",
birthdate = "date", #or character because needs date functions
age = "integer",
height = "double"))
# integer = "INTEGER"
# double = "REAL"
# character = "STRING"
# logical = "INTEGER"
# list = "BLOB"
# date = character = "STRING" # not sure
对于操作日期类型,您必须使用hive date functions,而不是R函数。
mycsv %>% mutate(birthyear = year(birthdate))
参考:https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions
答案 1 :(得分:1)
我们在官方sparklyr网站的一篇文章中有一个如何做到这一点的例子,这里是链接:http://spark.rstudio.com/example-s3.html#data_import