我将所有字符串字段存储在列表对象中。然后,当前,我正在for循环内传递每个字段,以计算合计计数。
我正在寻找一种方法来一次获取所有字符串列的聚合计数。请帮忙。
样本数据:
Dataframe(Input_Data)拥有这些记录
NoOfSegments,SegmentID,Country
3,2,Bangalore
3,2,Bangalore
3,3,Delhi
3,2,Delhi
3,3,Delhi
3,1,Pune
3,3,Bangalore
3,1,Pune
3,1,Delhi
3,3,Bangalore
3,1,Delhi
3,3,Bangalore
3,3,Pune
3,2,Delhi
3,3,Pune
3,2,Pune
3,2,Pune
3,3,Pune
3,1,Bangalore
3,1,Bangalore
我的代码:
input_data.createOrReplaceTempView('input_data')
sub="string"
category_columns = [name for name, data_type in input_data.dtypes
if sub in data_type]
df_final_schema = StructType([StructField("Country", StringType())
, StructField("SegmentID", IntegerType())
, StructField("total_cnt", IntegerType())
])
df_final=spark.createDataFrame([],df_final_schema)
for cat_col in category_columns:
query="SELECT {d_name} as Country,SegmentID ,(count(*) over(partition by {d_name},SegmentID)/ count(*) over(partition by NoOfSegments))*100 as total_cnt from input_temp order by {d_name},SegmentID".format(d_name=cat_col)
new_df=hc.sql(query)
df_final = df_final.union(new_df)
结果:
有什么办法可以传递所有字符串列并立即在一个数据帧中计算以上结果?
答案 0 :(得分:1)
您可以使用groupBy
(或groupby
)尝试以下操作:
from pyspark.sql import functions as F
total = df.select(F.sum("NoOfSegments")).take(1)[0][0]
df \
.groupBy("SegmentID", "Country") \
.agg(F.sum('NoOfSegments').alias('sums'))\
.withColumn('total_cnt', 100 * F.col('sums')/ F.lit(total)) \
.select('country', 'SegmentID', 'total_cnt') \
.sort('country', 'SegmentID').show()
# +---------+---------+---------+
# | Country|SegmentID|total_cnt|
# +---------+---------+---------+
# |Bangalore| 1| 10.0|
# |Bangalore| 2| 10.0|
# |Bangalore| 3| 15.0|
# | Delhi| 1| 10.0|
# | Delhi| 2| 10.0|
# | Delhi| 3| 10.0|
# | Pune| 1| 10.0|
# | Pune| 2| 10.0|
# | Pune| 3| 15.0|
# +---------+---------+---------+