将RelationalGroupedDataset保存到HDFS时出错

时间:2017-12-27 11:39:58

标签: apache-spark

我是Spark的新手并尝试在txt文件中编写分组数据,但我收到以下错误:

Error:(55, 31) value write is not a member of org.apache.spark.sql.RelationalGroupedDataset

代码段是 -

val dfyearlyGamesSelect = dfFiltered.select($"release_year",$"title")
val dfyearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
val dfWrite = dfyearlyGroup.write
                           .format("com.databricks.spark.csv")
                           .option("header","true")
                           .save(outputPath)

预期产量 - 每年,得分最高的游戏名称。(column-year_release,title,score)

示例数据 -

,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11
6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
7,Amazing,Guild Wars 2,/games/guild-wars-2/pc-896298,PC,9.0,RPG,Y,2012,9,11

1 个答案:

答案 0 :(得分:0)

write不提供GroupedData。您必须应用聚合函数来获取可以写入HDFS的Dataframe。

val dfYearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
                                       .agg( first($"title") as "title" )

现在dfYearlyGroup将是一个Dataframe,您可以将其写入HDFS。此外,您不必将其存储在变量中,因为它不会返回任何内容。

dfyearlyGroup.write
             .format("com.databricks.spark.csv")
             .option("header","true")
             .save(outputPath)

编辑:

对于您的用例,您可以使用窗口函数rankrownum,具体取决于您是否需要多行来获得相同的分数。

import org.apache.spark.sql.expressions.Window

df.select($"release_year", $"title", $"score").show(false)
+------------+----------------------------------------------------+-----+
|release_year|title                                               |score|
+------------+----------------------------------------------------+-----+
|2012        |LittleBigPlanet PS Vita                             |9.0  |
|2012        |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0  |
|2012        |Splice: Tree of Life                                |8.5  |
|2012        |NHL 13                                              |8.5  |
|2012        |NHL 13                                              |8.5  |
|2012        |Total War Battles: Shogun                           |7.0  |
|2012        |Double Dragon: Neon                                 |3.0  |
|2012        |Guild Wars 2                                        |9.0  |
+------------+----------------------------------------------------+-----+


val w = Window.partitionBy($"release_year").orderBy($"score".desc)

val dfYearlyMaxScore = df.withColumn("rank", rank over w)
                         .where($"rank" === lit(1) )
                         .select($"release_year", $"title", $"score")

dfYearlyMaxScore.show(false)

+------------+----------------------------------------------------+-----+
|release_year|title                                               |score|
+------------+----------------------------------------------------+-----+
|2012        |LittleBigPlanet PS Vita                             |9.0  |
|2012        |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0  |
|2012        |Guild Wars 2                                        |9.0  |
+------------+----------------------------------------------------+-----+

现在,您可以使用以下方式编写它:

dfYearlyMaxScore.write
                .format("com.databricks.spark.csv")
                .option("header","true")
                .save(outputPath)