爆炸后按多列过滤数据框

时间:2020-10-21 19:27:57

标签: apache-spark pyspark apache-spark-sql

我的df包含产品名称和相应的信息。相关的是出售给的名称和国家/地区:

+--------------------+-------------------------+
|        Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
|                null|     [Belgium,United K...|
|     #5 pecan/almond|                [Belgium]|
| #8 mango/strawberry|                [Belgium]|
|& Sully A Mild Th...|         [Belgium,France]|
|"70CL Liqueu...|         [Belgium,France]|
|"Gingembre&q...|                [Belgium]|
|"Les Schtrou...|         [Belgium,France]|
|"Sho-key&quo...|                [Belgium]|
|"mini Chupa ...|         [Belgium,France]|
|      'S Lands beste|                [Belgium]|
|'T vlierbos confi...|                [Belgium]|
|(H)eat me - Spagh...|                [Belgium]|
|       -cheese flips|                [Belgium]|
|     .soupe cerfeuil|                [Belgium]|
|1 1/2 Minutes Bas...|     [Belgium,Luxembourg]|
|   1/2 Reblochon AOP|                [Belgium]|
|  1/2 nous de jambon|                [Belgium]|
|1/2 tarte cerise ...|                [Belgium]|
|10 Original Knack...|     [Belgium,France,S...|
|    10 pains au lait|         [Belgium,France]|
+--------------------+-------------------------+

样本输入数据:

[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]

由于我想探索产品销售至比利时以外的国家/地区,因此我将“国家/地区”列拆分为使用以下代码分别显示每个国家/地区

#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")

#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
   .where(col("Countries")!="null")\
.select("Product_name",\
        F.split("Countries", ",").alias("Countries"),
        F.posexplode(F.split("Countries", ",")).alias("pos", "val")
    )\
    .drop("val")\
    .select(
        "Product_name",
        F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
        F.expr("Countries[pos]").alias("val")
    )\
    .groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()

但是,仅对于无法显示的国家/地区,此表现在有超过400列。所以我的问题是:

  1. 我是否正确地进行了分裂/爆炸?
  2. 我可以拆分df,以便以列名称(例如“ France”而不是“ countries1”等)来获得国家/地区,以计算产品在该国家/地区的销售次数吗?

2 个答案:

答案 0 :(得分:1)

一些样本数据:

val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")

转换为必需的df:

df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))

+------------+-------+
|Product_name|country|
+------------+-------+
|          p1|BELGIUM|
|          p1|GERMANY|
|          p1|BELGIUM|
|          p1|  ITALY|
|          p1|GERMANY|
|          p2|BELGIUM|
+------------+-------+

如果您只需要每个国家/地区的计数:

countDF = df.groupBy("Product_name","country").count()

countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
|          p1|BELGIUM|    2|
|          p1|GERMANY|    1|
|          p2|BELGIUM|    1|
+------------+-------+-----+

除了比利时:

countDF.filter(col("country") =!="BELGIUM").show()

+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
|          p1|GERMANY|    1|
+------------+-------+-----+

如果您确实希望将国家/地区作为列:

countDF.groupBy("Product_name").pivot("country").agg(first("count"))

+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
|          p2|      1|   null|
|          p1|      2|      1|
+------------+-------+-------+

您可以.drop("BELGIUM")来实现。

答案 1 :(得分:0)

使用的最终代码:

str_shuffle()