Question

我需要将分类变量的所有类别转换为列，并且其值应占总数的百分比。运行代码后，我已附加了用作pyspark数据框的输入表和输出表。我的代码给了我正确的结果，但是我认为可能会有更好的方法

我已经在pyspark中编写了代码。 CUST_DATA是我随附的pyspark数据框，以供参考。

Input_dataset

Output_view

cust_data_brand = cust_data.groupBy('locationkey').pivot('brand').agg(round(sum("quantity"),2)).orderBy('locationkey')
cust_data_brand = cust_data_brand.replace(float('nan'), None)

cust_data_brand=cust_data_brand.na.fill(0)

cols = cust_data_brand.columns[1:]


cust_data_brand_tot = cust_data_brand.withColumn('Brand_Tot', cust_data_brand.a+cust_data_brand.b+cust_data_brand.c)

for i in cols: 
    cust_data_brand_tot = cust_data_brand_tot.withColumn(i+'_percent', round((cust_data_brand_tot[i]/cust_data_brand_tot['Brand_Tot']) * 100,2))


cust_data_Type = cust_data.groupBy('locationkey').pivot('Type').agg(round(sum("quantity"),2)).orderBy('locationkey')
cust_data_Type = cust_data_Type.replace(float('nan'), None)

cust_data_Type=cust_data_Type.na.fill(0)

cols_Type = cust_data_Type.columns[1:]

cust_data_Type_tot = cust_data_Type.withColumn('Type_Tot', cust_data_Type.aa+cust_data_Type.bb+cust_data_Type.cc)

for i in cols_Type: 
    cust_data_Type_tot = cust_data_Type_tot.withColumn(i+'_percent', (cust_data_Type_tot[i]/cust_data_Type_tot['Type_Tot']) * 100)


df0 = loc_cnt.alias('df0')

df1 = cust_data_brand_tot.alias('df1')

df2 = cust_data_Type_tot.alias('df2')

loc_cnt = spark.sql("select distinct locationkey from cust_data")

from pyspark.sql.functions import col

Final_cust_var_df1 = df0.join(df1, col('df0.locationkey') == col('df1.locationkey'), 'left').drop(col('df1.locationkey')) \
.join(df2, col('df0.locationkey') == col('df2.locationkey'), 'left').drop(col('df2.locationkey'))

任何人都可以让我知道使用python或pyspark更好的方法吗？

如何为所有类别的分类变量创建数据透视表pyspark或python数据框

0 个答案: