我的df包含产品名称和相应的信息。相关的是出售给的名称和国家/地区:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
样本输入数据:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
由于我想探索产品销售至比利时以外的国家/地区,因此我将“国家/地区”列拆分为使用以下代码分别显示每个国家/地区
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
但是,仅对于无法显示的国家/地区,此表现在有超过400列。所以我的问题是:
答案 0 :(得分:1)
一些样本数据:
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
转换为必需的df:
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
如果您只需要每个国家/地区的计数:
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
除了比利时:
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
如果您确实希望将国家/地区作为列:
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
您可以.drop("BELGIUM")
来实现。
答案 1 :(得分:0)
使用的最终代码:
str_shuffle()