我有一个三层结构:属性-> prov-> co。每个属性都有一个细分,即酒店/房屋。我写了一个查询来获取以下各项的计数:
properties = spark.sql("""
SELECT
COUNT(ps.property_id) as property_count,
ps.prov_id,
c.id as co_id,
ps.segment
FROM
schema.t1 ps
INNER JOIN
schema.t2 c
ON c.id = p.co_id
GROUP BY
2,3,4
""")
properties = properties.toPandas()
这给了我每个区,每个省,每个公司的属性总数。通过上面的df properties
,我想创建一个新的df,如下所示:
- prov_id,
- prov_segment,
- co_id,
- co_segment
如果prov_segment
中超过50%的属性属于pro_id
段,则Home
应为“ Home”,否则应为Core
。
同样,如果{50}中有50%以上的co_segment
属于Home
prov_segment中,则prov_id
应该是Home
,否则它应该是核心。
我知道,我可以通过对数据进行分组来获得属性的总数:
prop_total_count = properties.groupby('prov_name')['property_count'].sum()
但是,我不确定如何使用它来创建新的数据框。
示例数据:
properties.show(6)
:
| property_count | prov_id | co_id | segment |
|----------------|---------|-------|---------|
| 10 | 1 | ABC | Core |
| 200 | 1 | ABC | Home |
| 300 | 9 | ABC | Core |
| 10 | 9 | ABC | Home |
| 100 | 131 | MNM | Home |
| 200 | 199 | KJK | Home |
基于上述内容,我需要以下输出:
| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1 | Home | ABC | Core |
| 9 | Core | ABC | Core |
| 131 | Home | MNM | Home |
| 199 | Home | KJK | Home |
prov_id 1之所以获得“首页”细分,是因为它具有200个首页属性,而核心属性为10个。 prov_id 9获得一个Core细分,因为它具有300个核心属性到10个Home属性。
co_id ABC之所以获得“核心”细分,是因为该投资组合总共拥有310个“核心”资产,而210个“家庭”资产。
prov_id 131和199仅在单个段中,因此该段仍然存在。
答案 0 :(得分:1)
好吧,也许可以以“更短的”方式解决此问题,但这应该可行。它依赖于创建每个组具有分段(co_id
或prov_id
)的其他两个DataFrame,然后在最后合并DataFrame。
使用较旧的co_id['co_segment']
版本无法将pandas
之类的系列合并到DataFrame中,因此出于兼容性目的,我添加了.to_frame()
函数。对于pandas
版本>= 0.25.1
,该操作是允许的,并且该函数调用是多余的。
NB :此代码假定仅有段为Home
,Core
和Managed
。
import pandas as pd
properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200],
'prov_id': [1, 1, 9, 9, 131, 199],
'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'],
'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})
def get_segment(row):
if row['home_perc'] > 0.5:
return 'Home'
elif row['core_perc'] > 0.5:
return 'Core'
else:
return 'Managed'
def get_grouped_dataframe(properties_df, grouping_col):
id = pd.DataFrame()
id['total'] = properties.groupby(grouping_col)['property_count'].sum()
id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
id['home_perc'] = id['home'] / id['total']
id['home_perc'] = id['home_perc'].fillna(0)
id['core_perc'] = id['core'] / id['total']
id['core_perc'] = id['core_perc'].fillna(0)
id['managed_perc'] = id['core'] / id['total']
id['managed_perc'] = id['core_perc'].fillna(0)
id['segment'] = id.apply(get_segment, axis=1)
return id
prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)
# total home core home_perc core_perc prov_segment
# prov_id
# 1 210 200 10.0 0.952381 0.047619 Home
# 9 310 10 300.0 0.032258 0.967742 Core
# 131 100 100 NaN 1.000000 0.000000 Home
# 199 200 200 NaN 1.000000 0.000000 Home
co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)
# total home core home_perc core_perc co_segment
# co_id
# ABC 520 210 310.0 0.403846 0.596154 Core
# KJK 200 200 NaN 1.000000 0.000000 Home
# MNM 100 100 NaN 1.000000 0.000000 Home
property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()
property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')
# prov_id co_id co_segment prov_segment
# 0 1 ABC Core Home
# 1 9 ABC Core Core
# 2 131 MNM Home Home
# 3 199 KJK Home Home
编辑:将重复的代码放入函数中,并根据注释添加Managed
段。添加额外的to_frame()
以实现兼容性。