我在pyspark
中有一个数据框,如下所示
df.show()
+-------+--------------------+--------------------+
| Dev_No| model| Tested|
+-------+--------------------+--------------------+
|BTA16C5| Windows PC| N|
|BTA16C5| SRL| N|
|BTA16C5| Hewlett Packard| N|
|CTA16C5| Android Devices| Y|
|CTA16C5| Hewlett Packard| N|
|4MY16A5| Other| N|
|4MY16A5| Other| N|
|4MY16A5| Tablet| Y|
|4MY16A5| Other| N|
|4MY16A5| Cable STB| Y|
|4MY16A5| Other| N|
|4MY16A5| Windows PC| Y|
|4MY16A5| Windows PC| Y|
|4MY16A5| Smart Watch| Y|
+-------+--------------------+--------------------+
现在使用上述数据框我想创建一个名为newcolumn
Tested_devices
的数据框,并在列中填充每个Dev_No
选择model
的值其中Tested
为Y
,并将所有值填充为逗号分隔。
df1.show()
+-------+--------------------+--------------------+------------------------------------------------------+
| Dev_No| model| Tested| Tested_devices|
+-------+--------------------+--------------------+------------------------------------------------------+
|BTA16C5| Windows PC| N| |
|BTA16C5| SRL| N| |
|BTA16C5| Hewlett Packard| N| |
|CTA16C5| Android Devices| Y| Android Devices|
|CTA16C5| Hewlett Packard| N| |
|4MY16A5| Other| N| |
|4MY16A5| Other| N| |
|4MY16A5| Tablet| Y| Tablet, Cable STB,Windows PC, Windows PC, Smart Watch|
|4MY16A5| Other| N| |
|4MY16A5| Cable STB| Y| Tablet, Cable STB,Windows PC, Windows PC, Smart Watch|
|4MY16A5| Other| N| |
|4MY16A5| Windows PC| Y| Tablet, Cable STB,Windows PC, Windows PC, Smart Watch|
|4MY16A5| Windows PC| Y| Tablet, Cable STB,Windows PC, Windows PC, Smart Watch|
|4MY16A5| Smart Watch| Y| Tablet, Cable STB,Windows PC, Windows PC, Smart Watch|
+-------+--------------------+--------------------+------------------------------------------------------+
我尝试过以下内容来选择Dev_No
和model
,其中Tested
为Y
a = df.select("Dev_No", "model"), when(df.Tested == 'Y')
我无法得到结果。它给了我以下错误
TypeError: when() takes exactly 2 arguments (1 given)
我如何实现我想要的目标
答案 0 :(得分:1)
评论清楚和解释
#window function to group by Dev_No
from pyspark.sql import Window
windowSpec = Window.partitionBy("Dev_No")
from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function to change the collected list to string and also to check if Tested column is Y or N
@f.udf(t.StringType())
def populatedUdfFunc(tested, list):
if(tested == "Y"):
return ", ".join(list)
else:
return ""
#collecting models when Tested is Y using window function defined above
df.withColumn("Tested_devices", populatedUdfFunc(f.col("Tested"), f.collect_list(f.when(f.col("Tested") == "Y", f.col("model")).otherwise(None)).over(windowSpec))).show(truncate=False)
应该给你
+-------+---------------+------+------------------------------------------------------+
|Dev_No |model |Tested|Tested_devices |
+-------+---------------+------+------------------------------------------------------+
|BTA16C5|Windows PC |N | |
|BTA16C5|SRL |N | |
|BTA16C5|Hewlett Packard|N | |
|4MY16A5|Other |N | |
|4MY16A5|Other |N | |
|4MY16A5|Tablet |Y |Tablet, Cable STB, Windows PC, Windows PC, Smart Watch|
|4MY16A5|Other |N | |
|4MY16A5|Cable STB |Y |Tablet, Cable STB, Windows PC, Windows PC, Smart Watch|
|4MY16A5|Other |N | |
|4MY16A5|Windows PC |Y |Tablet, Cable STB, Windows PC, Windows PC, Smart Watch|
|4MY16A5|Windows PC |Y |Tablet, Cable STB, Windows PC, Windows PC, Smart Watch|
|4MY16A5|Smart Watch |Y |Tablet, Cable STB, Windows PC, Windows PC, Smart Watch|
|CTA16C5|Android Devices|Y |Android Devices |
|CTA16C5|Hewlett Packard|N | |
+-------+---------------+------+------------------------------------------------------+
对于pyspark 1.6,collect_list
不能使用window
函数,并且没有在SqlContext中定义collect_list函数。所以你必须不使用窗口函数并使用HiveContext而不是SQLContext
from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function to change the collected list to string and also to check if Tested column is Y or N
def populatedUdfFunc(list):
return ", ".join(list)
populateUdf = f.udf(populatedUdfFunc, t.StringType())
#collecting models when Tested is Y using window function defined above
tempdf = df.groupBy("Dev_No").agg(populateUdf(f.collect_list(f.when(f.col("Tested") == "Y", f.col("model")).otherwise(None))).alias("Tested_devices"))
df.join(
tempdf,
(df["Dev_No"] == tempdf["Dev_No"]) & (df["Tested"] == f.lit("Y")), "left").show(truncate=False)
您将获得与上面相同的输出