我收到以下错误
Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)
启动以下命令时
spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1
最后两个参数(linkage
和linkage/out1
)是HDFS目录,第一个包含几个CSV文件,第二个不存在,我假设它将自动创建。< / p>
以下代码已使用REPL (Spark 1.1,Scala 2.10.4)成功进行了测试,当然除了saveAsTextFile()部分。我已按照O&#39; Reilly&#34; 使用Spark进行高级分析&#34;中所述的分步方法进行了操作。书。
由于它适用于REPL,我想使用Eclipse Juno将其转换为JAR文件,并使用以下代码。
package spark00
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object DataAnalysis1 {
case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
def isHeader(line:String) = line.contains("id_1")
def toDouble(s:String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line:String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
val sc = new SparkContext(conf)
// Load our input data.
val rawblocks = sc.textFile(args(0))
// CLEAN-UP
// a. calling !isHeader(): suppress header
val noheader = rawblocks.filter(!isHeader(_))
// b. calling parse(): setting feature types and renaming headers
val parsed = noheader.map(line => parse(line))
// EXPORT CLEAN FILE
parsed.coalesce(1,true).saveAsTextFile(args(1))
}
}
正如您所看到的,args(0)
应该是&#34;链接&#34;目录, args(1)
实际上是基于我linkage/out1
命令的输出HDFS目录 spark-submit
。
我还尝试了最后一行没有 coalesce(1,true)
这是parsed
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34
提前感谢您的支持
11月20日:我在运行spark-submit
命令时添加了效果良好的简单Wordcount代码,方法与上面的代码相同。因此,我的问题是:为什么saveAsTextFile()
适用于这个而不是其他代码?
object SpWordCount {
def main(args: Array[String]) {
// Create a Scala Spark Context.
val conf = new SparkConf().setMaster("local").setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(args(0))
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(args(1))
}
}