使用sparkR将数据保存到Hadoop中 - 崩溃

时间:2016-05-02 11:30:27

标签: apache-spark sparkr

我尝试的第一件事是将2G txt文件加载到R中并将其保存到Hadoop中。

我的笔记本电脑有4个内核和16G内存。 RAM的用法是:

2G RAM - windows and other apps
8G RAM - after loading data using read.csv
16G RAM - crashed when trying to save data into Hadoop using `df = createDataFrame(sqlContext, dat)`

有人知道在这种情况下如何避免崩溃RAM吗?或者sparkR不是加载数据并保存到Hadoop的好工具吗? (我也可以使用其他Hadoop工具和python)谢谢

代码:

library(rJava)

if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
    Sys.setenv(SPARK_HOME = 'D:\\spark-1.6.1-bin-hadoop2.6')
}

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sc = sparkR.init(master = "local[*]", 
                 sparkEnvir = list(spark.driver.memory = '4g'))

sqlContext = sparkRSQL.init(sc)

setwd('D:\\data\\Medicare_Provider_Util_Payment_PUF_CY2013')

dat = read.csv('Medicare_Provider_Util_Payment_PUF_CY2013.txt', header = T, sep = '\t', row.names = NULL)
head(dat)


df = createDataFrame(sqlContext, dat)

1 个答案:

答案 0 :(得分:0)

dat = read.csv('Medicare_Provider_Util_Payment_PUF_CY2013.txt', header = T, sep = '\t', row.names = NULL)


#Option 1:
#You can save it as hive tables
hiveContext <- sparkRHive.init(sc)
createDataFrame(hiveContext, dat) %>% saveAsTable("Hive_DataBase.HiveTable")

#Option 2:
#You can save as Parquet format
df = createDataFrame(sqlContext, dat)
write.df(df, path="df_1.parquet", source="parquet", mode="overwrite")