我打算在数据集上应用线性回归。当我以* .txt格式应用数据子集时,它工作正常,如下所示:
>
加标后,我打算将整个数据集(位于26 * .tar.gz文件中)应用于线性回归模型。我想知道如何将这些压缩文件读入Spark的// how could I read 26 *.tar.gz compressed files into a DataFrame?
val inputpath = "/Users/jasonzhu/Downloads/a.txt"
val rawDF = sc.textFile(inputpath).toDF()
val df = se.kth.spark.lab1.task2.Main.body(sqlContext, rawDF)
val splitDf = df.randomSplit(Array(0.95, 0.05), seed = 42L)
val (obsDF, testDF) =(splitDf(0).cache(), splitDf(1))
val maxIter = 6
val regParam = 0.07
val elasticNetParam = 0.1
println(s"maxIter=${maxIter}, regParam=${regParam}, elasticNetParam=${elasticNetParam}")
val myLR = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(regParam)
.setElasticNetParam(elasticNetParam)
val lrStage = 0
val pipeline = new Pipeline().setStages(Array(myLR))
val pipelineModel: PipelineModel = pipeline.fit(obsDF)
val lrModel = pipelineModel.stages(lrStage).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary
//print rmse of our model
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
//do prediction - print first k
val predictedDF = pipelineModel.transform(testDF)
predictedDF.show(5, false)
并通过利用Spark中的并行性来有效地使用它。谢谢!
答案 0 :(得分:1)
textFile()
方法也可以使用通配符。来自documentation:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
答案 1 :(得分:0)
从空RDD开始并运行循环以将每个文件读取为RDD,并在每次迭代中通过union
操作继续将RDD添加到初始RDD。