以下是我的数据,我正在使用parcel_id进行分组,如果需要 imprv_det_type_cd以MA开头
输入:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
在这种情况下,从上方仅考虑两行。
预期输出:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
代码:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
)
.withColumn(
"parcel_id",
F.regexp_replace('parcel_id', r'^[0]*', '')
)
.withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
.withColumn("year_built", F.min("year").over(w_impr))
).drop("sqft", "year").drop_duplicates(["parcel_id"])
)
我知道此代码中的.withColumn(“ structure_total_sqft”,F.sum(“ sqft”)。over(w_impr))有所更改,但不确定我必须做什么。我尝试了功能,但仍然无法正常工作。
先谢谢您
答案 0 :(得分:1)
我不知道您为什么要groupBy
,但您不想这么做。
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+
答案 1 :(得分:0)
使用 sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
first("imprv_det_type_cd").as("imprv_det_type_cd"),
first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/