在Pyspark数据框中的多列上执行功能

时间:2020-07-13 14:36:55

标签: apache-spark pyspark apache-spark-sql pyspark-dataframes

我必须在Pyspark数据框的多个列上应用某些功能。下面是我的代码:

finaldf=df.withColumn('phone_number',regexp_replace("phone_number","[^0-9]",""))\
    .withColumn('account_id',regexp_replace("account_id","[^0-9]",""))\
    .withColumn('credit_card_limit',regexp_replace("credit_card_limit","[^0-9]",""))\
    .withColumn('credit_card_number',regexp_replace("credit_card_number","[^0-9]",""))\
    .withColumn('full_name',regexp_replace("full_name","[^a-zA-Z ]",""))\
    .withColumn('transaction_code',regexp_replace("transaction_code","[^a-zA-Z]",""))\
    .withColumn('shop',regexp_replace("shop","[^a-zA-Z ]",""))

finaldf=finaldf.filter(finaldf.account_id.isNotNull())\
    .filter(finaldf.phone_number.isNotNull())\
    .filter(finaldf.credit_card_number.isNotNull())\
    .filter(finaldf.credit_card_limit.isNotNull())\
    .filter(finaldf.transaction_code.isNotNull())\
    .filter(finaldf.amount.isNotNull())

从代码中您可以看到我编写了一些冗余代码,这些代码也延长了程序的长度。我还了解到spark UDF效率不高。

有没有一种方法可以优化此代码?请告诉我。非常感谢!

1 个答案:

答案 0 :(得分:1)

对于 multiple filters ,您应该这样做。

filter_cols= ['account_id','phone_number','credit_card_number','credit_card_limit','transaction_code','amount']
    
final_df.filter(' and '.join([x+' is not null' for x in  filter_cols]))