根据现有列中的数据创建新列

时间:2020-03-25 08:09:14

标签: python-3.x pandas apache-spark-sql pandas-groupby

我有一个三层结构:属性-> prov-> co。每个属性都有一个细分,即酒店/房屋。我写了一个查询来获取以下各项的计数:

properties = spark.sql("""
    SELECT
        COUNT(ps.property_id) as property_count,
        ps.prov_id,
        c.id as co_id,
        ps.segment
    FROM
        schema.t1 ps
    INNER JOIN
        schema.t2 c
        ON c.id = p.co_id
    GROUP BY
        2,3,4
""")
properties = properties.toPandas()

这给了我每个区,每个省,每个公司的属性总数。通过上面的df properties,我想创建一个新的df,如下所示:

- prov_id,
- prov_segment,
- co_id,
- co_segment

如果prov_segment中超过50%的属性属于pro_id段,则Home应为“ Home”,否则应为Core。 同样,如果{50}中有50%以上的co_segment属于Home prov_segment中,则prov_id应该是Home,否则它应该是核心。

我知道,我可以通过对数据进行分组来获得属性的总数:

prop_total_count = properties.groupby('prov_name')['property_count'].sum()

但是,我不确定如何使用它来创建新的数据框。

示例数据:

properties.show(6)

| property_count | prov_id | co_id | segment |
|----------------|---------|-------|---------|
| 10             | 1       | ABC   | Core    |
| 200            | 1       | ABC   | Home    |
| 300            | 9       | ABC   | Core    |
| 10             | 9       | ABC   | Home    |
| 100            | 131     | MNM   | Home    |
| 200            | 199     | KJK   | Home    |

基于上述内容,我需要以下输出:

| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1       | Home         | ABC   | Core       |
| 9       | Core         | ABC   | Core       |
| 131     | Home         | MNM   | Home       |
| 199     | Home         | KJK   | Home       |

prov_id 1之所以获得“首页”细分,是因为它具有200个首页属性,而核心属性为10个。 prov_id 9获得一个Core细分,因为它具有300个核心属性到10个Home属性。

co_id ABC之所以获得“核心”细分,是因为该投资组合总共拥有310个“核心”资产,而210个“家庭”资产。

prov_id 131和199仅在单个段中,因此该段仍然存在。

1 个答案:

答案 0 :(得分:1)

好吧,也许可以以“更短的”方式解决此问题,但这应该可行。它依赖于创建每个组具有分段(co_idprov_id)的其他两个DataFrame,然后在最后合并DataFrame。

使用较旧的co_id['co_segment']版本无法将pandas之类的系列合并到DataFrame中,因此出于兼容性目的,我添加了.to_frame()函数。对于pandas版本>= 0.25.1,该操作是允许的,并且该函数调用是多余的。

NB :此代码假定仅有段为HomeCoreManaged

import pandas as pd

properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200], 
                                'prov_id': [1, 1, 9, 9, 131, 199], 
                                'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'], 
                                'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})


def get_segment(row):
    if row['home_perc'] > 0.5:
        return 'Home'
    elif row['core_perc'] > 0.5:
        return 'Core'
    else:
        return 'Managed'


def get_grouped_dataframe(properties_df, grouping_col):
    id = pd.DataFrame()
    id['total'] = properties.groupby(grouping_col)['property_count'].sum()
    id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
    id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
    id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
    id['home_perc'] = id['home'] / id['total']
    id['home_perc'] = id['home_perc'].fillna(0)
    id['core_perc'] = id['core'] / id['total']
    id['core_perc'] = id['core_perc'].fillna(0)
    id['managed_perc'] = id['core'] / id['total']
    id['managed_perc'] = id['core_perc'].fillna(0)
    id['segment'] = id.apply(get_segment, axis=1)

    return id


prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)

#          total  home   core  home_perc  core_perc prov_segment
# prov_id                                                  
# 1          210   200   10.0   0.952381   0.047619         Home
# 9          310    10  300.0   0.032258   0.967742         Core
# 131        100   100    NaN   1.000000   0.000000         Home
# 199        200   200    NaN   1.000000   0.000000         Home

co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)

#        total  home   core  home_perc  core_perc co_segment
# co_id                                                  
# ABC      520   210  310.0   0.403846   0.596154       Core
# KJK      200   200    NaN   1.000000   0.000000       Home
# MNM      100   100    NaN   1.000000   0.000000       Home

property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()

property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')

#    prov_id co_id co_segment prov_segment
# 0        1   ABC       Core         Home
# 1        9   ABC       Core         Core
# 2      131   MNM       Home         Home
# 3      199   KJK       Home         Home

编辑:将重复的代码放入函数中,并根据注释添加Managed段。添加额外的to_frame()以实现兼容性。