我需要将分类变量的所有类别转换为列,并且其值应占总数的百分比。运行代码后,我已附加了用作pyspark数据框的输入表和输出表。我的代码给了我正确的结果,但是我认为可能会有更好的方法
我已经在pyspark中编写了代码。 CUST_DATA是我随附的pyspark数据框,以供参考。
Input_dataset
Output_view
cust_data_brand = cust_data.groupBy('locationkey').pivot('brand').agg(round(sum("quantity"),2)).orderBy('locationkey')
cust_data_brand = cust_data_brand.replace(float('nan'), None)
cust_data_brand=cust_data_brand.na.fill(0)
cols = cust_data_brand.columns[1:]
cust_data_brand_tot = cust_data_brand.withColumn('Brand_Tot', cust_data_brand.a+cust_data_brand.b+cust_data_brand.c)
for i in cols:
cust_data_brand_tot = cust_data_brand_tot.withColumn(i+'_percent', round((cust_data_brand_tot[i]/cust_data_brand_tot['Brand_Tot']) * 100,2))
cust_data_Type = cust_data.groupBy('locationkey').pivot('Type').agg(round(sum("quantity"),2)).orderBy('locationkey')
cust_data_Type = cust_data_Type.replace(float('nan'), None)
cust_data_Type=cust_data_Type.na.fill(0)
cols_Type = cust_data_Type.columns[1:]
cust_data_Type_tot = cust_data_Type.withColumn('Type_Tot', cust_data_Type.aa+cust_data_Type.bb+cust_data_Type.cc)
for i in cols_Type:
cust_data_Type_tot = cust_data_Type_tot.withColumn(i+'_percent', (cust_data_Type_tot[i]/cust_data_Type_tot['Type_Tot']) * 100)
df0 = loc_cnt.alias('df0')
df1 = cust_data_brand_tot.alias('df1')
df2 = cust_data_Type_tot.alias('df2')
loc_cnt = spark.sql("select distinct locationkey from cust_data")
from pyspark.sql.functions import col
Final_cust_var_df1 = df0.join(df1, col('df0.locationkey') == col('df1.locationkey'), 'left').drop(col('df1.locationkey')) \
.join(df2, col('df0.locationkey') == col('df2.locationkey'), 'left').drop(col('df2.locationkey'))
任何人都可以让我知道使用python或pyspark更好的方法吗?