加入两个数据帧,总结值并获得最大值

时间:2017-02-18 14:17:13

标签: scala apache-spark

我有两个.txt数据文件。第一列包含两列(电影,电影),第二列包含两列(电影,观众),如下例所示。我想要做的是找到cinema_1中显示的观众人数最多的电影。

+----------+---------+
| movie    |  cinema |
+----------+---------+
| movie_1 | cinema_2 |
| movie_2 | cinema_3 |
| movie_4 | cinema_1 |
| movie_3 | cinema_1 |
+------+-------------+

+----------+---------+
| movie    | viewers |
+----------+---------+
| movie_1 |    10    |
| movie_2 |    98    |
| movie_4 |    100   |
| movie_3 |    19    |
| movie_1 |    340   |
| movie_3 |    31    |
+------+-------------+

即。在上面的例子中,两个候选人是movie_3movie_4(显示在cinema_1中),正确的答案是movie_4,有100个观看次数(而movie_3有50个(19岁以上) 31)观点)。

到目前为止我做了什么:

第1步:获取数据

    val moviesCinemas = sparkSession.read
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .option("mode", "DROPMALFORMED")
        .load("moviesCinemas.txt");

    val moviesViewers = sparkSession.read
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .option("mode", "DROPMALFORMED")
        .load("moviesViewers.txt");  

第2步:获取cinema_1

中显示的电影
val cinema1Movies = moviesCinemas.filter(col("cinema").like("cinema_1"))

导致:

+----------+---------+
| movie    |  cinema |
+----------+---------+
| movie_4 | cinema_1 |
| movie_3 | cinema_1 |
+------+-------------+

第3步:现在,对于这两部电影,我必须总结他们的观众(来自数据帧moviesViewers)并报告具有最大数量的观众。这是我实际上被卡住的地方。

我尝试加入cinema1MoviesmoviesViewers数据框

val joinMoviesViewers = moviesViewers.join(cinema1Movies, Seq("movie"))

给出以下结果:

+----------+---------+
| movie    | viewers |
+----------+---------+
| movie_4 |    100   |
| movie_3 |    19    |
| movie_3 |    31    |
+------+-------------+

现在我不太确定如何为每个viewers总结movie以获得这样的结果(并最终获得最多观众的电影):

+----------+---------+
| movie    | viewers |
+----------+---------+
| movie_4 |    100   |
| movie_3 |    50    |
+------+-------------+

2 个答案:

答案 0 :(得分:1)

以下是推导结果的API方法。

import org.apache.spark.sql.functions._

val result = moviesCinemas
  .filter($"cinema" === "cinema_1" )
  .join(moviesViewers, "movie")
  .select(moviesCinemas("movie"),moviesViewers("viewers"))
  .groupBy($"movie")
  .agg(sum($"viewers").as("sum_cnt"))
  .orderBy($"sum_cnt".desc)

  result.first
  res34: org.apache.spark.sql.Row = [movie_4,100]

下面使用spark sql来获得相同的结果。

moviesCinemas.registerTempTable("movies_cinemas")
moviesViewers.registerTempTable("movies_viewers")

val spark = SparkSession.builder.
  master("local") // set your master here
  .appName("spark session example")
  .getOrCreate()

val result = spark.sql( 
"""
SELECT 
t0.movie,
sum(viewers) as total_viewers
FROM
movies_cinemas t0 JOIN movies_viewers t1
on t0.movie = t1.movie 
WHERE t0.cinema = "cinema_1"
GROUP BY t0.movie
ORDER BY total_viewers desc
"""
)

result.first

res6: org.apache.spark.sql.Row = [movie_4,100]

答案 1 :(得分:1)

从已加入的数据框开始:

val aggJoin = joinMoviesViewers.groupBy("movie").agg(sum($"viewers").as("viewers"))
// aggJoin: org.apache.spark.sql.DataFrame = [movie: string, viewers: bigint]

val maxViewers = aggJoin.agg(max($"viewers")).first().getLong(0)
// maxViewers: Long = 100

// depending on what data type you have for viewers, you might use getDouble here
// val maxViewers = aggJoin.agg(max($"viewers")).first().getDouble(0)

aggJoin.filter($"viewers" === maxViewers).show
+-------+-------+
|  movie|viewers|
+-------+-------+
|movie_4|    100|
+-------+-------+