我下面有一个DataFrame
-
from pyspark.sql.functions import col, when, length, lit, concat
values = [(1,'USA','12424','AB+'),(2,'Japan','63252','B-'),(3,'Ireland','23655',None),(4,'France','57366','O+'),
(5,'Ireland','82351','A-'),(6,'USA','35854','B+'),(7,'Ireland','5835','AB-'),(8,'USA','95255','B+')]
df = sqlContext.createDataFrame(values,['id','country','postcode','bloodgroup'])
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
| 1| USA| 12424| AB+|
| 2| Japan| 63252| B-|
| 3|Ireland| 23655| null|
| 4| France| 57366| O+|
| 5|Ireland| 82351| A-|
| 6| USA| 35854| B+|
| 7|Ireland| 5835| AB-|
| 8| USA| 95255| B+|
+---+-------+--------+----------+
我需要根据以下条件在postcode
和bloodgroup
列中进行更改,如本粗糙的python pseudocode
-
# Efficient (pseudocode 1)
if country == 'Ireland':
if length(postcode) == 4:
postcode = '0'+postcode # Append 0 to postcode incase it's 4 digit.
if bloodgroup == null:
bloodgroup = 'Unknown'
正如您在上面的伪代码中看到的那样,检查country == 'Ireland'
仅进行了一次一次,因为它在两种情况下都是常见子句。通过使用and
将此子句与其他两个条件耦合来做另一种方式,效率会很低-
# Inefficient (pseudocode 2)
if country == 'Ireland' and length(postcode) == 4:
postcode = '0'+postcode
if country == 'Ireland' and bloodgroup == null:
bloodgroup = 'Unknown'
我正在使用PySpark
,而我知道的唯一方法如下-
df = df.withColumn('postcode',when((col('country') == 'Ireland') & (length(col('postcode')) == 4),concat(lit('0'),col('postcode'))).otherwise(col('postcode')))
df = df.withColumn('bloodgroup',when((col('country') == 'Ireland') & col('bloodgroup').isNull(),'Unknown').otherwise(col('bloodgroup')))
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
| 1| USA| 12424| AB+|
| 2| Japan| 63252| B-|
| 3|Ireland| 23655| Unknown|
| 4| France| 57366| O+|
| 5|Ireland| 82351| A-|
| 6| USA| 35854| B+|
| 7|Ireland| 05835| AB-|
| 8| USA| 95255| B+|
+---+-------+--------+----------+
但是,这对应于我上面编写的低效伪代码,因为我们两次检查了country == 'Ireland'
。我已经使用executionPlan
检查了df.explain()
,但它没有做任何自动优化,我认为催化剂可以做到。
我们如何编写与伪代码1对应的PySpark
代码,在其中我们先检查一次国家/地区,然后测试这2个条件?