使用sparkR合并大型数据集

时间:2016-01-12 02:30:16

标签: r apache-spark sparkr

我想知道sparkR是否更容易合并大数据集而不是“常规R”?我有12个csv文件,大约500,000行乘40列。这些文件是2014年的月度数据。我想为2014年制作一个文件。这些文件都有相同的列标签,我想在第一列(年)合并。但是,某些文件的行数比其他文件多。

当我运行以下代码时:

library(SparkR)
library(magrittr)
# setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv")
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)

Jan2014_file_path <- file.path( 'Jan2014.csv')

system.time(
housing_a_df <- read.df(sqlContext, 
                      "C:\\Users\\Anonymous\\Desktop\\Data       2014\\Jan2014.csv", 
                      header='true',  
                      inferSchema='false')
)

R坠毁了。

当我运行此代码时:

   Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0        in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):

我收到了以下错误:

class Rule < ActiveRecord::Base
  before_validation :ensure_tags_unique

  private
  def ensure_tags_unique
    self.tags = self.tags.uniq
  end
end

那么在sparkR中合并这些文件的简单方法是什么?

2 个答案:

答案 0 :(得分:0)

您应该以这种格式阅读csv文件: 参考:https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85

# Launch SparkR using 
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3

# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the local CSV file using `read.df`. Note that we use the CSV reader Spark package here.
Jan2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data 2014/Jan2014.csv", "com.databricks.spark.csv", header="true")

Feb2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data  2014/Feb2014.csv", "com.databricks.spark.csv", header="true")

#For merging / joining by year

#join
   jan_feb_2014 <- join(Jan2014 , Feb2014 , joinExpr = Jan2014$year == Feb2014$year1, joinType = "left_outer")
# I used "left_outer", so i want all columns of Jan2014 and matching of columns Feb2014, based upon your requirement change the join type. 
#rename the Feb2014 column name year to year1, as it gets duplicated while joining. Then you can remove the column "jan_feb_2014$year1" after joining by the code, "jan_feb_2014$year1 <- NULL"

这是如何逐个加入文件的。

答案 1 :(得分:0)

将文件读取为数据帧后,您可以使用SparkR中的unionAll将数据帧合并为单个数据帧。然后你可以把它写成一个单独的csv文件。

示例代码

    df1 <- read.df(sqlContext, "/home/user/tmp/test1.csv", source = "com.databricks.spark.csv")
    df2 <- read.df(sqlContext, "/home/user/tmp/test2.csv", source = "com.databricks.spark.csv")
    mergedDF <- unionAll(df1, df2)
    write.df(mergedDF, "merged.csv", "com.databricks.spark.csv", "overwrite")

我已经测试并使用过它,但不是针对您的尺寸数据。 但我希望这会对你有所帮助