计算与pyspark.sql中爆炸列匹配的记录数?

时间:2019-06-21 20:05:27

标签: pyspark pyspark-sql

我有一个使用Spark 2.4和部分Yelp数据集的作业。下面将在业务数据中使用的架构部分已在同一DataFrame中使用:

"business_id": string
"categories": comma delimited list of strings
"stars": double

我们应该创建一个新的DataFrame,将其按类别分组,并包含以下几列:

"category": string exploded from "categories"
"businessCount": integer; number of businesses in that category
"averageStarRating": double; average rating of businesses in the category
"minStarRating": double; lowest rating of any restaurant in that category
"maxStarRating": double; highest rating of any restaurant in that category

到目前为止,我已经能够弄清楚如何使用explode命令将“类别”列分解为单独的记录,并显示“ business_id”,“类别”和“星标”:

import from pyspark.sql functions as F
businessdf.select("business_id", F.explode(F.split("categories", ",")).alias("category"), "stars").show(5)

上面的命令给了我这个结果:

+--------------------+--------------+-----+
|         business_id|      category|stars|
+--------------------+--------------+-----+
|1SWheh84yJXfytovI...|          Golf|  3.0|
|1SWheh84yJXfytovI...|   Active Life|  3.0|
|QXAEGFB4oINsVuTFx...|Specialty Food|  2.5|
|QXAEGFB4oINsVuTFx...|   Restaurants|  2.5|
|QXAEGFB4oINsVuTFx...|       Dim Sum|  2.5|
+--------------------+--------------+-----+
only showing top 5 rows

我不知道怎么做是使用聚合函数创建其他列。我的教授说,所有这些都必须一言以蔽之。到目前为止,我所有的尝试都导致错误。

我的作业说,在进行任何汇总之前,我还需要删除新创建的“类别”列上的所有前导/后缀空格,但是我的所有尝试都导致了错误。

我觉得这是我最近来的,但不知道下一步该怎么做:

businessdf.select(F.explode(F.split("categories", ",")).alias("category")).groupBy("category").agg(F.count("category").alias("businessCount"), F.avg("stars").alias("averageStarRating"), F.min("stars").alias("minStarRating"), F.max("stars").alias("maxStarRating"))

以下是该命令附带的错误:

`pyspark.sql.utils.AnalysisException: "cannot resolve '`stars`' given input columns: [category];;\n'Aggregate [category#337], [category#337, count(category#337) AS businessCount#342L, avg('stars) AS averageStarRating#344, min('stars) AS minStarRating#346, max('stars) AS maxStarRating#348]\n+- Project [category#337]\n   +- Generate explode(split(categories#33, ,)), false, [category#337]\n      +- Relation[address#30,attributes#31,business_id#32,categories#33,city#34,hours#35,is_open#36L,latitude#37,lo`ngitude#38,name#39,postal_code#40,review_count#41L,stars#42,state#43] json\n"

1 个答案:

答案 0 :(得分:0)

没关系,发布内容一定可以帮助我自己完成工作。我上面发布的命令非常接近,但是我忘记在选择语句中添加“星号”列。正确的命令在这里:

businessdf.select(F.explode(F.split("categories", ",")).alias("category"), "stars").groupBy("category").agg(F.count("category").alias("businessCount"), F.avg("stars").alias("averageStarRating"), F.min("stars").alias("minStarRating"), F.max("stars").alias("maxStarRating")).show()