Question

以下是我的数据，我正在使用parcel_id进行分组，如果需要 imprv_det_type_cd以MA开头

输入：

+------------+----+-----+-----------------+
|   parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272|               MA|
|000000100010|2014|  800|              60P|
|000000100010|2014| 3200|              MA2|
|000000100010|2014| 1620|              49R|
|000000100010|2014| 1446|              46R|
|000000100010|2014|40140|              45B|
|000000100010|2014| 1800|              45C|
|000000100010|2014|  864|              49C|
|000000100010|2014|    1|              48S|
+------------+----+-----+-----------------+

在这种情况下，从上方仅考虑两行。

预期输出：

+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010   |MA               |7472               |2014       |
+---------+-----------------+--------------------+----------+

代码：

# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
    w_impr = Window.partitionBy("parcel_id")
    return(
    
        (spark.read.text(path_ade_imp_info)
            .select(
                F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
                F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
                F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
                F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
            )
            .withColumn(
                    "parcel_id",
                    F.regexp_replace('parcel_id', r'^[0]*', '')
            )
            .withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
            .withColumn("year_built", F.min("year").over(w_impr))
        ).drop("sqft", "year").drop_duplicates(["parcel_id"])
    )

我知道此代码中的.withColumn（“ structure_total_sqft”，F.sum（“ sqft”）。over（w_impr））有所更改，但不确定我必须做什么。我尝试了功能，但仍然无法正常工作。

先谢谢您

Answer 1

我不知道您为什么要groupBy，但您不想这么做。

df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
  .filter("imprv_det_type_cd like 'MA%'") \
  .groupBy('parcel_id', 'year') \
  .agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
  .show(10, False)

+---------+----+------+-----------------+
|parcel_id|year|sqft  |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010   |2014|7472.0|MA               |
+---------+----+------+-----------------+

Answer 2

使用 sum(when(..))

 df2.show(false)
    df2.printSchema()
    /**
      * +------------+----+-----+-----------------+
      * |parcel_id   |year|sqft |imprv_det_type_cd|
      * +------------+----+-----+-----------------+
      * |000000100010|2014|4272 |MA               |
      * |000000100010|2014|800  |60P              |
      * |000000100010|2014|3200 |MA2              |
      * |000000100010|2014|1620 |49R              |
      * |000000100010|2014|1446 |46R              |
      * |000000100010|2014|40140|45B              |
      * |000000100010|2014|1800 |45C              |
      * |000000100010|2014|864  |49C              |
      * |000000100010|2014|1    |48S              |
      * +------------+----+-----+-----------------+
      *
      * root
      * |-- parcel_id: string (nullable = true)
      * |-- year: string (nullable = true)
      * |-- sqft: string (nullable = true)
      * |-- imprv_det_type_cd: string (nullable = true)
      */

    val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
      .agg(
        sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
        first("imprv_det_type_cd").as("imprv_det_type_cd"),
        first($"year").as("year_built")
      )
    p.show(false)
    p.explain()

    /**
      * +---------+--------------------+-----------------+----------+
      * |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
      * +---------+--------------------+-----------------+----------+
      * |100010   |7472.0              |MA               |2014      |
      * +---------+--------------------+-----------------+----------+
      */

有条件计算总和

2 个答案: