熊猫:是否可以将列中的值分组在一起?

时间:2018-10-05 22:15:51

标签: python pandas

我正在使用熊猫处理一个DataFrame,其中一个列称为"Rank",用于军事排名。当我在数据上使用groupby并创建crosstab时,我注意到Rank中的某些值是同义词。例如,我的交叉表中有“第一类私人”,“第一类私人”和“ PFC”的单独行。

假设,我可以手动创建一个字典来将所有这些“同义词”关联在一起,是否可以让熊猫将其应用于我的DataFrame,以便所有值都被认为是相同的出于计数,交叉表等目的?因此,对于上面的示例,如果我决定对“ PFC”进行标准化,则会创建以下内容:{"Private 1st Class": "PFC", "Private First Class": "PFC"}

我看过groupby,但据我所知,它按列对整个帧进行排序,并且不支持类似的值。如果我错了,有人可以指出我文档的相关部分吗?

2 个答案:

答案 0 :(得分:2)

为您提供示例:

数据:

df = pd.DataFrame({"val": [1,2,3,4,5], "key": ["Private 1st class", "Private First Class", "PFC", "other", "other"]})

翻译词典:

translate = pd.DataFrame.from_records({"key": ["Private 1st class", "PFC", "Private First Class"],
                           "harmonizedkey": ["PFC", "PFC", "PFC"]})

让我们将字典合并到df

newdf = pd.merge(df, translate, how = "left", on = "key")

创建一个新的(完整的)组:

newdf["newgroup"] = newdf["harmonizedkey"].combine_first(newdf["key"])
newdf

    key                 val harmonizedkey   newgroup
0   Private 1st class   1   PFC             PFC
1   Private First Class 2   PFC             PFC
2   PFC                 3   PFC             PFC
3   other               4   NaN             other
4   other               5   NaN             other

现在,使用groupby

newdf.groupby("newgroup").sum()

        val
newgroup    
PFC     6
other   9

答案 1 :(得分:0)

使用地图和字典来生成新列:

import pandas as pd
df = pd.DataFrame([
    ('Private 1st Class', 3),
    ('Private First Class', 2),
    ('PFC', 5),
    ('Sergeant', 2),
    ('SGT', 2)
], columns = ['rank', 'bannanas'])

d = {
    'Private 1st Class': 'PFC',
    'Private First Class': 'PFC',
    'PFC': 'PFC',
    'Sergeant': 'SGT',
    'SGT': 'SGT'
}

df['merged_rank'] = df['rank'].map(d)
print(df)
                  rank  bannanas merged_rank
0    Private 1st Class         3         PFC
1  Private First Class         2         PFC
2                  PFC         5         PFC
3             Sergeant         2         SGT
4                  SGT         2         SGT

print(df.groupby('merged_rank')['bannanas'].agg('sum'))   

merged_rank
PFC    10
SGT     4
Name: bannanas, dtype: int64