我有一个名为monthly_agg的表,其中包含每月的汇总数据。
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| yyyy_mm_dd | id | app | ex_status | active_status | active_count | active_base | ex_count | ex_base |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 123 | content | impl | impl | 390 | 321 | 344 | 340 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 333 | messages | impl | impl | 541 | 210 | 788 | 610 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 832 | photos | no | no | null | 430 | null | 100 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
我想让每个应用程序都成为一列。每个应用程序列应包含一个百分比,其计算公式如下:
df=spark.sql("""
SELECT
yyyy_mm_dd,
id,
app,
SUM(CASE
WHEN (app = ‘content’ AND ex_status = ‘impl’) THEN ex_count/ex_base
WHEN (active_status = 'impl') THEN active_count/active_base
END) AS percentage
FROM
monthly_agg
""")
我需要使每个app
值成为一列,然后将该列的值作为上述计算的结果。如何使用Pandas而不是HQL来以这种方式旋转表?理想情况下,我的输出df如下所示:
+------------+-----+--------------------+---------------------+
| yyyy_mm_dd | id | content_percentage | messages_percentage |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 123 | 1.2 | null |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 333 | null | 2.57 |
+------------+-----+--------------------+---------------------+
我大约有20个应用程序,所以动态性会很棒。
答案 0 :(得分:1)
使用numpy.select
作为百分比,然后使用DataFrame.pivot_table
:
m1 = (df.app == 'content') & (df.ex_status == 'impl')
m2 = df.active_status == 'impl'
s1 = df.ex_count / df.ex_base
s2 = df.ex_count / df.active_base
df['percentage'] = np.select([m1, m2], [s1,s2], np.nan)
df1 = (df.pivot_table(index=['yyyy_mm_dd','id'],
columns='app',
values='percentage',
aggfunc=lambda x: x.sum(min_count=1))
.add_suffix('_percentage')
.reset_index())
print (df1)
app yyyy_mm_dd id content_percentage messages_percentage
0 20190131 123 1.011765 NaN
1 20190131 333 NaN 3.752381
编辑:
print (m1)
0 True
1 False
2 False
dtype: bool
print (m2)
0 True
1 True
2 False
Name: active_status, dtype: bool
print (s1)
0 1.011765
1 1.291803
2 NaN
dtype: float64
print (s2)
0 1.071651
1 3.752381
2 NaN
dtype: float64