我有一个问题,即为书籍找到评分最高的标签,我的数据看起来像这样
图书
+---+--------------------+-----------+
| id| tags| title|
+---+--------------------+-----------+
| 1| PHP, JAVA, XML, SQL| First Book|
| 2| PHP, Javascript|Second Book|
| 3|PHP, Javascript, ...| Third Book|
| 4|PHP, RUBY, YAML, SQL|Fourth Book|
+---+--------------------+-----------+
评分
+-------+------+----+
|book_id|rating|user|
+-------+------+----+
| 1| 4| 15|
| 3| 2| 15|
| 1| 5| 17|
| 2| 3| 21|
+-------+------+----+
我提出的解决方案使用了SparkSQL,因为我更熟悉SQL
val spark = SparkSession.builder().appName("Books").getOrCreate()
val tags = books.withColumn("tags_splitted", split($"tags", ",")).withColumn("tag_exploded", explode($"tags_splitted")).select("id", "tag_exploded")
tags.createOrReplaceTempView("tags")
ratings.createOrReplaceTempView("ratings")
val result = spark.sql("SELECT tag_exploded, count(*) FROM tags, ratings WHERE tags.id = ratings.book_id GROUP BY tag_exploded ORDER BY count(*) DESC")
虽然这给了我后面的结果,但我想知道是否有更好的方法使用Spark上的map / reduce来解决这个问题。这看起来并没有利用我的并行/分布式处理