Spark错误“输出目录文件已存在

时间:2017-02-06 13:24:00

标签: windows scala apache-spark

我执行了简单的示例(spark,Windows7)并收到意外的错误消息 FileAlreadyExistsException 。我在计算机上找不到该文件夹​​或文件。

  

线程“main”中的异常   org.apache.hadoop.mapred.FileAlreadyExistsException:输出目录   file:/ PluralsightData / ReadMeWordCountViaApp已经存在           在org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)           在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ saveAsHadoopDataset $ 1.apply $ mcV $ sp(PairRDDFunctions.scala:1191)           在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ saveAsHadoopDataset $ 1.apply(PairRDDFunctions.scala:1168)           在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ saveAsHadoopDataset $ 1.apply(PairRDDFunctions.scala:1168)

package main

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._

object WordCounter {
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Word Counter")
        val sc = new SparkContext(conf)
        //val textFile = sc.textFile("file:///Spark/README.md")
        val textFile = sc.textFile("file:///README.md")
        val tokenizedFileData = textFile.flatMap(line=>line.split(" "))
        val countPrep = tokenizedFileData.map(word=>(word, 1))
        val counts = countPrep.reduceByKey((accumValue, newValue)=>accumValue + newValue)
        val sortedCounts = counts.sortBy(kvPair=>kvPair._2, false)
        sortedCounts.saveAsTextFile("file:///PluralsightData/ReadMeWordCountViaApp")
    }
}

可以找到样本的来源https://github.com/constructor-igor/TechSugar/tree/master/ScalaSamples/WordCounterSample

1 个答案:

答案 0 :(得分:0)

根据评论:

  1. Spark更愿意避免覆盖任何现有数据。

  2. 目标文件的绝对路径允许在本地磁盘上查找结果数据。

    sortedCounts.saveAsTextFile( “文件:/// C:/温度/ ReadMeWordCountViaApp”)