Question

我正在学习如何使用Spark / Scala从HDFS中的文件进行读写。我无法写入HDFS文件，文件已创建，但它是空的。我不知道如何创建一个用于在文件中写入的循环。

代码是：

import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

// Read the adult CSV file
  val logFile = "hdfs://zobbi01:9000/input/adult.csv"
  val conf = new SparkConf().setAppName("Simple Application")
  val sc = new SparkContext(conf)
  val logData = sc.textFile(logFile, 2).cache()


  //val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
  val headerAndRows = logData.map(line => line.split(",").map(_.trim))
  val header = headerAndRows.first
  val data = headerAndRows.filter(_(0) != header(0))
  val maps = data.map(splits => header.zip(splits).toMap)
  val result = maps.filter(map => map("AGE") != "23")

  result.foreach{

      result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
  }

如果我更换： result.foreach{println}

然后它有效！

但是当使用（saveAsTextFile）方法时，会抛出一条错误消息

<console>:76: error: type mismatch;
 found   : Unit
 required: scala.collection.immutable.Map[String,String] => Unit
             result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")

请帮助。

Answer 1

result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")

这就是你需要做的。您不需要遍历所有行。

希望这有帮助！

Answer 2

这是做什么!!!

 result.foreach{
  result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
 }

除非设置特殊配置，否则无法从RDD action触发RDD transformations。

只需使用result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")即可保存到HDFS。

如果要写入文件中需要其他格式，请在写入之前更改rdd。

使用Spark / Scala在HDFS文件中使用迭代写入

2 个答案: