保存文件不带括号

时间:2019-06-26 17:03:26

标签: scala apache-spark dataframe rdd

我希望我的最终结果没有括号

我已经尝试过了,但是它返回了很多错误:

.map(x => x.mkString(",").saveAsTextFile("/home/amel/new")

这是我的代码

val x= sc.textFile("/home/amel/1MB").filter(!_.contains("NULL"))
.filter(!_.contains("Null"))
val re = x.map(row => {
val cols = row.split(",")
val Cycle = cols(2)
val Duration = Cycle match {
case "Licence" => "3 years"
case "Master" => "2 years"
case "Ingéniorat" => "5 years"
case "Ingeniorat" => "5 years"
case "Doctorat" => "3 years"
case _ => "NULL"
}
(cols(1).split("-")(0) + "," + Cycle + "," + Duration + "," + 
cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
}

这是我得到的结果:

(1999,2 years,Master,IC,57)

(2013,3 years,Doctorat,SI,44)

(2013,3 years,Licence,IC,73)

(2009,5 years,Ingeniorat,IC,58)

(2011,2 years,Master,SI,61)

(2003,5 years,Ingeniorat,IC,65)

(2019,3 years,Doctorat,SI,80)

我想:删除开头和结尾的括号。

2 个答案:

答案 0 :(得分:4)

而不是像这样re.collect.foreach(println)

收集和打印

您可以做这样的事情...

val x: Seq[(Int, String, String, String, Int)] = Seq((1999, "2 years", "Master", "IC", 57), (2013,"3 years","Doctorat","SI",44))
    x.map(p => p.productIterator.mkString(",")).foreach(println)

结果:

1999,2 years,Master,IC,57
2013,3 years,Doctorat,SI,44

或者简单地,您可以使用数据框来实现此结果:

import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession

object TupleTest {
  org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    import spark.implicits._
    val rdd = spark.sparkContext.parallelize(Seq((1, "Spark"), (2, "Databricks"), (3, "Notebook")))
    val df = rdd.toDF("Id", "Name")
    df.coalesce(1).write.mode("overwrite").csv("./src/main/resouces/single")
  }

}

文本文件中的结果savad:

1,Spark
2,Databricks
3,Notebook

答案 1 :(得分:0)

另一个答案未考虑您数据的结构。您有一个(String,Int)元组,因此必须执行以下操作:

}).reduceByKey(_ + _)
re.collect.foreach(println)
}

对此:

}).reduceByKey(_ + _).map(x => x._1 + "," + x._2)
re.collect.foreach(println)
}