R - 在函数中一次插入所有列类型

时间:2018-01-18 18:02:53

标签: r apache-spark sparklyr

如果我想一次将character类型归因于我的所有列,使用任何函数,例如来自spark_read_csv的{​​{1}},我会做类似的事情

sparklyr

有没有办法让它减轻痛苦?

使用flights <- spark_read_csv(sc, "flights_spark", path = "/path/flights.csv", memory = TRUE, columns = list( Year = "character", Month = "character", DayofMonth = "character", DayOfWeek = "character", DepTime = "character", CRSDepTime = "character", ArrTime = "character", CRSArrTime = "character", UniqueCarrier = "character", FlightNum = "character", TailNum = "character", ActualElapsedTime = "character", CRSElapsedTime = "character", AirTime = "character", ArrDelay = "character", DepDelay = "character", Origin = "character", Dest = "character", Distance = "character", TaxiIn = "character", TaxiOut = "character", Cancelled = "character", CancellationCode = "character", Diverted = "character", CarrierDelay = "character", WeatherDelay = "character", NASDelay = "character", SecurityDelay = "character", LateAircraftDelay = "character"), infer_schema = FALSE) 中的fread的示例:

data.table

1 个答案:

答案 0 :(得分:0)

由于所有字段都是character并且您禁用了模式推断,因此只需一个简单的名称列表即可:

spark_read_csv(sc,
  "flights_spark", 
  path =  "/path/flights.csv", 
  columns = list("Year", "Month", ..., "LateAircraftDelay")
  infer_schema = FALSE)

虽然没有架构推断,但您应该能够完全跳过它,而不会显着降低性能。

spark_read_csv(sc,
  "flights_spark", 
  path =  "/path/flights.csv", 
  infer_schema = FALSE)

在一般情况下(不同的类型),一个命名列表就是诀窍:

names_ <- c("Year", "Month", ..., "LateAircraftDelay")
dtypes <- list("integer", "integer", ..., "string")

spark_read_csv(sc,
  "flights_spark", 
  path =  "/path/flights.csv", 
  columns = setNames(dtypes, names_),
  infer_schema = FALSE)