我是PYSPARK的新手。
我正在从一个表中读取数据并更新同一表。我有一个要求,我必须在列中搜索一个小的字符串,如果找到,我需要将其写到新列中。
逻辑是这样的:
IF
(Terminal_Region is not NULL & Terminal_Region contains "WC") OR
(Terminal_Footprint is not NULL & Terminal_Footprint contains "WC")
THEN REGION = "EOR"
ELSE
REGION ="WOR"
如果这两个字段都为NULL,则REGION ='NotMapped'
我需要使用PYSPARK在Datafarme中创建一个新的区域。有人可以帮我吗?
|Terminal_Region |Terminal_footprint | REGION |
+-------------------+-------------------+----------+
| west street WC | | EOR |
| WC 87650 | | EOR |
| BOULVEVARD WC | | EOR |
| | |Not Mapped|
| |landinf dr WC | EOR |
| |FOX VALLEY WC 76543| EOR |
+-------------------+-------------------+----------+
答案 0 :(得分:0)
我认为以下代码应创建所需的输出。该代码应与spark 2.2(包括contains
函数)一起使用。
from pyspark.sql.functions import *
df = spark.createDataFrame([("west street WC",None),\
("WC 87650",None),\
("BOULVEVARD WC",None),\
(None,None),\
(None,"landinf dr WC"),\
(None,"FOX VALLEY WC 76543")],\
["Terminal_Region","Terminal_footprint"]) #Creating Dataframe
df.show() #print initial df
df.withColumn("REGION", when( col("Terminal_Region").isNull() & col("Terminal_footprint").isNull(), "NotMapped").\ #check if both are Null
otherwise(when((col("Terminal_Region").contains("WC")) | ( col("Terminal_footprint").contains("WC")), "EOR").otherwise("WOR"))).show() #otherwise search for "WC"
输出:
#initial dataframe
+---------------+-------------------+
|Terminal_Region| Terminal_footprint|
+---------------+-------------------+
| west street WC| null|
| WC 87650| null|
| BOULVEVARD WC| null|
| null| null|
| null| landinf dr WC|
| null|FOX VALLEY WC 76543|
+---------------+-------------------+
# df with the logic applied
+---------------+-------------------+---------+
|Terminal_Region| Terminal_footprint| REGION|
+---------------+-------------------+---------+
| west street WC| null| EOR|
| WC 87650| null| EOR|
| BOULVEVARD WC| null| EOR|
| null| null|NotMapped|
| null| landinf dr WC| EOR|
| null|FOX VALLEY WC 76543| EOR|
+---------------+-------------------+---------+