Spark 1.1:使用saveAsTextFile在HDFS中保存RDD

时间:2014-11-18 20:17:24

标签: eclipse scala hadoop hdfs apache-spark

我收到以下错误

Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)

启动以下命令时

spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1

最后两个参数(linkagelinkage/out1)是HDFS目录,第一个包含几个CSV文件,第二个不存在,我假设它将自动创建。< / p>

以下代码已使用REPL (Spark 1.1,Scala 2.10.4)成功进行了测试,当然除了saveAsTextFile()部分。我已按照O&#39; Reilly&#34; 使用Spark进行高级分析&#34;中所述的分步方法进行了操作。书。

由于它适用于REPL,我想使用Eclipse Juno将其转换为JAR文件,并使用以下代码。

    package spark00

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object DataAnalysis1 {

  case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)

  def isHeader(line:String) = line.contains("id_1")

  def toDouble(s:String) = {
     if ("?".equals(s)) Double.NaN else s.toDouble
  }

  def parse(line:String) = {
    val pieces = line.split(",")
    val id1 = pieces(0).toInt
    val id2 = pieces(1).toInt
    val scores = pieces.slice(2, 11).map(toDouble)
    val matched = pieces(11).toBoolean
    MatchData(id1, id2, scores, matched)
  }

  def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
        val sc = new SparkContext(conf)
        // Load our input data.
        val rawblocks =  sc.textFile(args(0))

        // CLEAN-UP
        // a. calling !isHeader(): suppress header
        val noheader = rawblocks.filter(!isHeader(_))
        // b. calling parse(): setting feature types and renaming headers
        val parsed = noheader.map(line => parse(line))

        // EXPORT CLEAN FILE
        parsed.coalesce(1,true).saveAsTextFile(args(1))
  }

}

正如您所看到的,args(0)应该是&#34;链接&#34;目录, args(1) 实际上是基于我linkage/out1命令的输出HDFS目录 spark-submit

我还尝试了最后一行没有 coalesce(1,true)

这是parsed

的官方RDD类型
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34

提前感谢您的支持

11月20日:我在运行spark-submit命令时添加了效果良好的简单Wordcount代码,方法与上面的代码相同。因此,我的问题是:为什么saveAsTextFile()适用于这个而不是其他代码?

object SpWordCount {
    def main(args: Array[String]) {
        // Create a Scala Spark Context.
        val conf = new SparkConf().setMaster("local").setAppName("wordCount")
        val sc = new SparkContext(conf)
        // Load our input data.
        val input =  sc.textFile(args(0))
        // Split it up into words.
        val words = input.flatMap(line => line.split(" "))
        // Transform into word and count.
        val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
        // Save the word count back out to a text file, causing evaluation.
        counts.saveAsTextFile(args(1))
    }
}

0 个答案:

没有答案