如果我想一次将character
类型归因于我的所有列,使用任何函数,例如来自spark_read_csv
的{{1}},我会做类似的事情
sparklyr
有没有办法让它减轻痛苦?
使用flights <- spark_read_csv(sc, "flights_spark",
path = "/path/flights.csv",
memory = TRUE,
columns = list(
Year = "character",
Month = "character",
DayofMonth = "character",
DayOfWeek = "character",
DepTime = "character",
CRSDepTime = "character",
ArrTime = "character",
CRSArrTime = "character",
UniqueCarrier = "character",
FlightNum = "character",
TailNum = "character",
ActualElapsedTime = "character",
CRSElapsedTime = "character",
AirTime = "character",
ArrDelay = "character",
DepDelay = "character",
Origin = "character",
Dest = "character",
Distance = "character",
TaxiIn = "character",
TaxiOut = "character",
Cancelled = "character",
CancellationCode = "character",
Diverted = "character",
CarrierDelay = "character",
WeatherDelay = "character",
NASDelay = "character",
SecurityDelay = "character",
LateAircraftDelay = "character"),
infer_schema = FALSE)
中的fread
的示例:
data.table
答案 0 :(得分:0)
由于所有字段都是character
并且您禁用了模式推断,因此只需一个简单的名称列表即可:
spark_read_csv(sc,
"flights_spark",
path = "/path/flights.csv",
columns = list("Year", "Month", ..., "LateAircraftDelay")
infer_schema = FALSE)
虽然没有架构推断,但您应该能够完全跳过它,而不会显着降低性能。
spark_read_csv(sc,
"flights_spark",
path = "/path/flights.csv",
infer_schema = FALSE)
在一般情况下(不同的类型),一个命名列表就是诀窍:
names_ <- c("Year", "Month", ..., "LateAircraftDelay")
dtypes <- list("integer", "integer", ..., "string")
spark_read_csv(sc,
"flights_spark",
path = "/path/flights.csv",
columns = setNames(dtypes, names_),
infer_schema = FALSE)