我有2个csv文件,一个包含(UserId,MovieId,Rating)第二个包含(MovieId,title,genres)。我想将它们合并到一个文件中而没有重复的MovieId。
import sqlContext.implicits._
import sqlContext._
case class DataClass(UserId: Int, MovieId:Int, ratings: Double)
val Data = sc.textFile("file:///usr/local/spark/dataset/rating.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim.toDouble)).toDF()
case class DataClass2( MovieId:Int, title: String,genres:String)
val Data2 = sc.textFile("file:///usr/local/spark/dataset/movieupdate").map(_.split(",")).map(p => DataClass2(p(0).trim.toInt, p(1).trim, p(2).trim)).toDF()
val merged=Data2.union(Data)
merged.rdd
.map(_.toSeq.map(_+"").reduce(_+","+_))
.saveAsTextFile("/usr/local/spark/dataset/merged")
如何正确地将它们合并为UserId,MovieId,ratings,title,genre?
答案 0 :(得分:0)
我认为您可以通过MovieId字段将其加入,然后选择所需的数据,然后对其进行区分。
val merged = Data2.join(Data, Data2("MovieId") === Data("MoviewId"), "left").select("Data2.*", "Data.*").distinct()
答案 1 :(得分:0)
我相信您想加入两个数据集。您可以执行简单的左外部联接。
这是您可以调整以从文本文件中读取的代码段。
import org.apache.spark.sql.functions.col
case class DataClass(UserId: Int, MovieId:Int, ratings: Double)
case class DataClass2(MovieId:Int, title: String,genres:String)
val Data = spark.createDataFrame(
DataClass(101,1,5)
:: DataClass(102,1,4)
:: DataClass(103,2,3):: Nil)
val Data2 = spark.createDataFrame(
DataClass2(1,"Movie Title 1","Action")
:: DataClass2(2,"Movie Title 2","Sci Fi"):: Nil)
val MergedData = Data.join(Data2, Seq("MovieId"),"left_outer")
Data.show(3,false)
+------+-------+-------+
|UserId|MovieId|ratings|
+------+-------+-------+
|101 |1 |5.0 |
|102 |1 |4.0 |
|103 |2 |3.0 |
+------+-------+-------+
Data2.show(2,false)
+-------+-------------+------+
|MovieId|title |genres|
+-------+-------------+------+
|1 |Movie Title 1|Action|
|2 |Movie Title 2|Sci Fi|
+-------+-------------+------+
加入的数据集将是:
MergedData.show(20,false)
+-------+------+-------+-------------+------+
|MovieId|UserId|ratings|title |genres|
+-------+------+-------+-------------+------+
|1 |101 |5.0 |Movie Title 1|Action|
|1 |102 |4.0 |Movie Title 1|Action|
|2 |103 |3.0 |Movie Title 2|Sci Fi|
+-------+------+-------+-------------+------+