我正在为一个小组项目进行考试,在该项目中,我们需要分析具有700万行和30列以上的数据集。自上个月以来,我们一直在学习sparklyr。
我们没有任何可用的集群,我们的分析代码需要在具有不同设置和操作系统的不同计算机上运行。
导入数据集后,我们进行一些转换并创建新变量,然后删除其他变量。为了使数据在内存方面尽可能的轻便,我一直在使用sdf_register()来获取中间结果,并在完成操作数据集以将Spark表插入Spark缓存中时使用了compute(),准备运行一些线性模型
library(sparklyr)
library(dplyr)
# spark connection ####
Sys.setenv(JAVA_HOME = system("/usr/libexec/java_home -V", intern = T))
connection <- spark_connect(master = 'local')
# import airline dataset ####
data.folder <- "/airlinebefore2009"
system.time(complete.data <- spark_read_csv(
path = data.folder, sc = connection, name = "airlineDataset"))
complete.data %>%
group_by(ORIGIN, DEST) %>%
mutate(speed = (DISTANCE*1.60934)/(AIR_TIME/60)) %>%
filter(CANCELLED == 0 & DIVERTED == 0 & speed>300 & speed<990) %>%
na.replace(LATE_AIRCRAFT_DELAY=0, SECURITY_DELAY=0, NAS_DELAY=0, WEATHER_DELAY=0, CARRIER_DELAY=0) %>%
select(ORIGIN, DEST, DEP_DELAY, DEP_DEL15, TAXI_OUT, TAXI_IN,
ARR_DELAY, ARR_DEL15, CANCELLED, DIVERTED, CRS_ELAPSED_TIME,
AIR_TIME, DISTANCE, CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY,
SECURITY_DELAY, LATE_AIRCRAFT_DELAY) %>%
sdf_register('airlinedataset')
... some transformation similar to the next code block...
no.issue.flights <- tbl(connection, 'airlinedataset') %>%
group_by(ORIGIN, DEST)%>%
mutate(percentile = ifelse(DEP_DEL15 == 0 & ARR_DEL15 == 0,
percent_rank(AIR_TIME), NA),
airtimeNA = ifelse(percentile > 0.9 | percentile < 0.1, NA, AIR_TIME),
mean_air_time = mean(airtimeNA, na.rm = T),
new_air_time = AIR_TIME - mean_air_time,
frequenza = ifelse(DEP_DEL15 == 0 & ARR_DEL15 == 0, 1, 0),
sum_frequenza = sum(frequenza, na.rm = T)) %>%
filter(sum_frequenza > 5) %>%
ungroup() %>%
select(ARR_DEL15, DEP_DEL15, new_taxi_out, new_taxi_in, new_air_time, DEP_DELAY, ARR_DELAY,
WEATHER_DELAY, NAS_DELAY, CARRIER_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY) %>%
compute('onlygoodflights')
因为我通过no.issue.flights
将compute()
放入Spark缓存中,所以有时会得到java.lang.OutOfMemoryError: Java heap space
。
但是,如果我运行代码的以下部分,而没有将no.issue.flights
放入缓存,则在运行以下线性模型时会出现相同的java.lang.OutOfMemoryError: Java heap space
错误或R session aborted
:
# 1st model ####
del1arr0 <- no.issue.flights %>%
filter(DEP_DELAY>16 & ARR_DELAY<16)
partitions <- del1arr0 %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
training_set <- partitions$training
test_set <- partitions$test
lm_model <- training_set %>%
ml_linear_regression(ARR_DELAY ~ DEP_DELAY + new_taxi_out + new_taxi_in + new_air_time)
# 2nd model ###
del1arr1 <- no.issue.flights %>%
filter(DEP_DELAY>15 & ARR_DELAY>15)
partitions <- del1arr1 %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
training_set <- partitions$training
test_set <- partitions$test
lm_model <- training_set %>%
ml_linear_regression(ARR_DELAY ~ DEP_DELAY + new_taxi_out + new_taxi_in + new_air_time)
如果数据集零件代码运行没有错误,有时在运行第一个模型或第二个模型后仍然会出现java.lang.OutOfMemoryError: Java heap space
错误。我不明白为什么会这样。
我的计算机是具有8GB RAM和500GB SSD的MacBook Pro 2010
spark_config()
$sparklyr.connect.csv.embedded
[1] "^1.*"
$spark.sql.catalogImplementation
[1] "hive"
$sparklyr.connect.cores.local
[1] 4
$spark.sql.shuffle.partitions.local
[1] 4