我有与自治市镇,街道名称和邮政编码有关的数据。我正在尝试根据自治市镇和街道名称填写邮政编码中的缺失值
我的数据如下:
ï..BOROUGH Street.Name Zip.Code
2850662 BRONX CITY ISLAND ROAD 10464
2850740 BRONX CITY ISLAND ROAD 10464
2850749 BRONX CITY ISLAND ROAD NA
2850919 BRONX CITY ISLAND ROAD 10464
3491200 BRONX CITY ISLAND ROAD NA
预期输出为:
ï..BOROUGH Street.Name Zip.Code
2850662 BRONX CITY ISLAND ROAD 10464
2850740 BRONX CITY ISLAND ROAD 10464
2850749 BRONX CITY ISLAND ROAD 10464
2850919 BRONX CITY ISLAND ROAD 10464
3491200 BRONX CITY ISLAND ROAD 10464
答案 0 :(得分:0)
我认为我们需要遵循这种方法-
尝试此代码-
from pyspark.sql.types import *
from pyspark.sql.functions import col
schema = StructType([StructField('BOROUGH', IntegerType(), True),
StructField('Street_Name', StringType(), True),
StructField('Zip_Code', IntegerType(), True)])
data = [(2850662,'BRONX CITY ISLAND ROAD',10464),
(2850740,'BRONX CITY ISLAND ROAD',10464),
(2850749,'BRONX CITY ISLAND ROAD',None),
(2850919,'BRONX CITY ISLAND ROAD',10464),
(3491200,'BRONX CITY ISLAND ROAD',None)]
df = spark.createDataFrame(data,schema)
df_Zip_Code = df.filter(df.Zip_Code.isNotNull()).select('Street_Name','Zip_Code').distinct()
df.alias('a').\
join(df_Zip_Code.alias('b'),col('a.Street_Name') == col('b.Street_Name'), 'inner').\
selectExpr("a.BOROUGH AS BOROUGH","a.Street_Name AS Street_Name","CASE WHEN a.Zip_Code IS NULL THEN b.Zip_Code ELSE a.Zip_Code END AS Zip_Code" ).\
show()