我有两个.txt数据文件。第一列包含两列(电影,电影),第二列包含两列(电影,观众),如下例所示。我想要做的是找到cinema_1
中显示的观众人数最多的电影。
+----------+---------+
| movie | cinema |
+----------+---------+
| movie_1 | cinema_2 |
| movie_2 | cinema_3 |
| movie_4 | cinema_1 |
| movie_3 | cinema_1 |
+------+-------------+
+----------+---------+
| movie | viewers |
+----------+---------+
| movie_1 | 10 |
| movie_2 | 98 |
| movie_4 | 100 |
| movie_3 | 19 |
| movie_1 | 340 |
| movie_3 | 31 |
+------+-------------+
即。在上面的例子中,两个候选人是movie_3
和movie_4
(显示在cinema_1中),正确的答案是movie_4
,有100个观看次数(而movie_3
有50个(19岁以上) 31)观点)。
到目前为止我做了什么:
第1步:获取数据
val moviesCinemas = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("moviesCinemas.txt");
val moviesViewers = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("moviesViewers.txt");
第2步:获取cinema_1
val cinema1Movies = moviesCinemas.filter(col("cinema").like("cinema_1"))
导致:
+----------+---------+
| movie | cinema |
+----------+---------+
| movie_4 | cinema_1 |
| movie_3 | cinema_1 |
+------+-------------+
第3步:现在,对于这两部电影,我必须总结他们的观众(来自数据帧moviesViewers
)并报告具有最大数量的观众。这是我实际上被卡住的地方。
我尝试加入cinema1Movies
和moviesViewers
数据框
val joinMoviesViewers = moviesViewers.join(cinema1Movies, Seq("movie"))
给出以下结果:
+----------+---------+
| movie | viewers |
+----------+---------+
| movie_4 | 100 |
| movie_3 | 19 |
| movie_3 | 31 |
+------+-------------+
现在我不太确定如何为每个viewers
总结movie
以获得这样的结果(并最终获得最多观众的电影):
+----------+---------+
| movie | viewers |
+----------+---------+
| movie_4 | 100 |
| movie_3 | 50 |
+------+-------------+
答案 0 :(得分:1)
以下是推导结果的API方法。
import org.apache.spark.sql.functions._
val result = moviesCinemas
.filter($"cinema" === "cinema_1" )
.join(moviesViewers, "movie")
.select(moviesCinemas("movie"),moviesViewers("viewers"))
.groupBy($"movie")
.agg(sum($"viewers").as("sum_cnt"))
.orderBy($"sum_cnt".desc)
result.first
res34: org.apache.spark.sql.Row = [movie_4,100]
下面使用spark sql来获得相同的结果。
moviesCinemas.registerTempTable("movies_cinemas")
moviesViewers.registerTempTable("movies_viewers")
val spark = SparkSession.builder.
master("local") // set your master here
.appName("spark session example")
.getOrCreate()
val result = spark.sql(
"""
SELECT
t0.movie,
sum(viewers) as total_viewers
FROM
movies_cinemas t0 JOIN movies_viewers t1
on t0.movie = t1.movie
WHERE t0.cinema = "cinema_1"
GROUP BY t0.movie
ORDER BY total_viewers desc
"""
)
result.first
res6: org.apache.spark.sql.Row = [movie_4,100]
答案 1 :(得分:1)
从已加入的数据框开始:
val aggJoin = joinMoviesViewers.groupBy("movie").agg(sum($"viewers").as("viewers"))
// aggJoin: org.apache.spark.sql.DataFrame = [movie: string, viewers: bigint]
val maxViewers = aggJoin.agg(max($"viewers")).first().getLong(0)
// maxViewers: Long = 100
// depending on what data type you have for viewers, you might use getDouble here
// val maxViewers = aggJoin.agg(max($"viewers")).first().getDouble(0)
aggJoin.filter($"viewers" === maxViewers).show
+-------+-------+
| movie|viewers|
+-------+-------+
|movie_4| 100|
+-------+-------+