我想知道sparkR是否更容易合并大数据集而不是“常规R”?我有12个csv文件,大约500,000行乘40列。这些文件是2014年的月度数据。我想为2014年制作一个文件。这些文件都有相同的列标签,我想在第一列(年)合并。但是,某些文件的行数比其他文件多。
当我运行以下代码时:
library(SparkR)
library(magrittr)
# setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv")
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Jan2014_file_path <- file.path( 'Jan2014.csv')
system.time(
housing_a_df <- read.df(sqlContext,
"C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv",
header='true',
inferSchema='false')
)
R坠毁了。
当我运行此代码时:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
我收到了以下错误:
class Rule < ActiveRecord::Base
before_validation :ensure_tags_unique
private
def ensure_tags_unique
self.tags = self.tags.uniq
end
end
那么在sparkR中合并这些文件的简单方法是什么?
答案 0 :(得分:0)
您应该以这种格式阅读csv文件: 参考:https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the local CSV file using `read.df`. Note that we use the CSV reader Spark package here.
Jan2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data 2014/Jan2014.csv", "com.databricks.spark.csv", header="true")
Feb2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data 2014/Feb2014.csv", "com.databricks.spark.csv", header="true")
#For merging / joining by year
#join
jan_feb_2014 <- join(Jan2014 , Feb2014 , joinExpr = Jan2014$year == Feb2014$year1, joinType = "left_outer")
# I used "left_outer", so i want all columns of Jan2014 and matching of columns Feb2014, based upon your requirement change the join type.
#rename the Feb2014 column name year to year1, as it gets duplicated while joining. Then you can remove the column "jan_feb_2014$year1" after joining by the code, "jan_feb_2014$year1 <- NULL"
这是如何逐个加入文件的。
答案 1 :(得分:0)
将文件读取为数据帧后,您可以使用SparkR中的unionAll将数据帧合并为单个数据帧。然后你可以把它写成一个单独的csv文件。
示例代码
df1 <- read.df(sqlContext, "/home/user/tmp/test1.csv", source = "com.databricks.spark.csv")
df2 <- read.df(sqlContext, "/home/user/tmp/test2.csv", source = "com.databricks.spark.csv")
mergedDF <- unionAll(df1, df2)
write.df(mergedDF, "merged.csv", "com.databricks.spark.csv", "overwrite")
我已经测试并使用过它,但不是针对您的尺寸数据。 但我希望这会对你有所帮助