如何合并两个csv文件而不重复一列?

时间:2019-12-17 14:21:38

标签: scala apache-spark-sql

我有2个csv文件,一个包含(UserId,MovieId,Rating)第二个包含(MovieId,title,genres)。我想将它们合并到一个文件中而没有重复的MovieId。

 import sqlContext.implicits._
    import sqlContext._
    case class DataClass(UserId: Int, MovieId:Int, ratings: Double)
    val Data = sc.textFile("file:///usr/local/spark/dataset/rating.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim.toDouble)).toDF()
    case class DataClass2( MovieId:Int, title: String,genres:String)
    val Data2 = sc.textFile("file:///usr/local/spark/dataset/movieupdate").map(_.split(",")).map(p => DataClass2(p(0).trim.toInt, p(1).trim, p(2).trim)).toDF()
    val merged=Data2.union(Data)
    merged.rdd
      .map(_.toSeq.map(_+"").reduce(_+","+_))
      .saveAsTextFile("/usr/local/spark/dataset/merged")

如何正确地将它们合并为UserId,MovieId,ratings,title,genre?

2 个答案:

答案 0 :(得分:0)

我认为您可以通过MovieId字段将其加入,然后选择所需的数据,然后对其进行区分。

val merged = Data2.join(Data, Data2("MovieId") === Data("MoviewId"), "left").select("Data2.*", "Data.*").distinct()

答案 1 :(得分:0)

我相信您想加入两个数据集。您可以执行简单的左外部联接。

这是您可以调整以从文本文件中读取的代码段。

import org.apache.spark.sql.functions.col

case class DataClass(UserId: Int, MovieId:Int, ratings: Double)

case class DataClass2(MovieId:Int, title: String,genres:String)

val Data = spark.createDataFrame(
   DataClass(101,1,5)
:: DataClass(102,1,4)
:: DataClass(103,2,3):: Nil)


val Data2 = spark.createDataFrame(
   DataClass2(1,"Movie Title 1","Action")
:: DataClass2(2,"Movie Title 2","Sci Fi"):: Nil)

val MergedData = Data.join(Data2, Seq("MovieId"),"left_outer")

Data.show(3,false)

+------+-------+-------+
|UserId|MovieId|ratings|
+------+-------+-------+
|101   |1      |5.0    |
|102   |1      |4.0    |
|103   |2      |3.0    |
+------+-------+-------+

Data2.show(2,false)

+-------+-------------+------+
|MovieId|title        |genres|
+-------+-------------+------+
|1      |Movie Title 1|Action|
|2      |Movie Title 2|Sci Fi|
+-------+-------------+------+

加入的数据集将是:

MergedData.show(20,false)

+-------+------+-------+-------------+------+
|MovieId|UserId|ratings|title        |genres|
+-------+------+-------+-------------+------+
|1      |101   |5.0    |Movie Title 1|Action|
|1      |102   |4.0    |Movie Title 1|Action|
|2      |103   |3.0    |Movie Title 2|Sci Fi|
+-------+------+-------+-------------+------+