按列分组后填写缺失值

时间:2019-03-30 19:44:41

标签: pyspark pyspark-sql

我有与自治市镇,街道名称和邮政编码有关的数据。我正在尝试根据自治市镇和街道名称填写邮政编码中的缺失值

我的数据如下:

    ï..BOROUGH      Street.Name Zip.Code
    2850662      BRONX CITY ISLAND ROAD    10464
    2850740      BRONX CITY ISLAND ROAD    10464
    2850749      BRONX CITY ISLAND ROAD       NA
    2850919      BRONX CITY ISLAND ROAD    10464
    3491200      BRONX CITY ISLAND ROAD       NA

预期输出为:

    ï..BOROUGH      Street.Name Zip.Code
    2850662      BRONX CITY ISLAND ROAD    10464
    2850740      BRONX CITY ISLAND ROAD    10464
    2850749      BRONX CITY ISLAND ROAD    10464
    2850919      BRONX CITY ISLAND ROAD    10464
    3491200      BRONX CITY ISLAND ROAD    10464

1 个答案:

答案 0 :(得分:0)

我认为我们需要遵循这种方法-

  1. 获取“街道名称”和“邮政编码”的映射(过滤出空的邮政编码)
  2. 使用“ Street_Name”将主数据框与Zip_Code数据框连接起来,如果在主数据框中不为空,则填充“邮政编码”,否则从我们的Zip_Code数据框中填充。

尝试此代码-

from pyspark.sql.types import *
from pyspark.sql.functions import col

schema = StructType([StructField('BOROUGH', IntegerType(), True),
                     StructField('Street_Name', StringType(), True),
                     StructField('Zip_Code', IntegerType(), True)])


data = [(2850662,'BRONX CITY ISLAND ROAD',10464),
        (2850740,'BRONX CITY ISLAND ROAD',10464),
        (2850749,'BRONX CITY ISLAND ROAD',None),
        (2850919,'BRONX CITY ISLAND ROAD',10464),
        (3491200,'BRONX CITY ISLAND ROAD',None)]

df = spark.createDataFrame(data,schema)

df_Zip_Code = df.filter(df.Zip_Code.isNotNull()).select('Street_Name','Zip_Code').distinct()

df.alias('a').\
    join(df_Zip_Code.alias('b'),col('a.Street_Name') == col('b.Street_Name'), 'inner').\
    selectExpr("a.BOROUGH AS BOROUGH","a.Street_Name AS Street_Name","CASE WHEN a.Zip_Code IS NULL THEN b.Zip_Code ELSE a.Zip_Code END AS Zip_Code" ).\
    show()