我正在使用熊猫处理一个DataFrame,其中一个列称为"Rank"
,用于军事排名。当我在数据上使用groupby
并创建crosstab
时,我注意到Rank
中的某些值是同义词。例如,我的交叉表中有“第一类私人”,“第一类私人”和“ PFC”的单独行。
假设,我可以手动创建一个字典来将所有这些“同义词”关联在一起,是否可以让熊猫将其应用于我的DataFrame,以便所有值都被认为是相同的出于计数,交叉表等目的?因此,对于上面的示例,如果我决定对“ PFC”进行标准化,则会创建以下内容:{"Private 1st Class": "PFC", "Private First Class": "PFC"}
。
我看过groupby
,但据我所知,它按列对整个帧进行排序,并且不支持类似的值。如果我错了,有人可以指出我文档的相关部分吗?
答案 0 :(得分:2)
为您提供示例:
数据:
df = pd.DataFrame({"val": [1,2,3,4,5], "key": ["Private 1st class", "Private First Class", "PFC", "other", "other"]})
翻译词典:
translate = pd.DataFrame.from_records({"key": ["Private 1st class", "PFC", "Private First Class"],
"harmonizedkey": ["PFC", "PFC", "PFC"]})
让我们将字典合并到df
:
newdf = pd.merge(df, translate, how = "left", on = "key")
创建一个新的(完整的)组:
newdf["newgroup"] = newdf["harmonizedkey"].combine_first(newdf["key"])
newdf
key val harmonizedkey newgroup
0 Private 1st class 1 PFC PFC
1 Private First Class 2 PFC PFC
2 PFC 3 PFC PFC
3 other 4 NaN other
4 other 5 NaN other
现在,使用groupby
:
newdf.groupby("newgroup").sum()
val
newgroup
PFC 6
other 9
答案 1 :(得分:0)
使用地图和字典来生成新列:
import pandas as pd
df = pd.DataFrame([
('Private 1st Class', 3),
('Private First Class', 2),
('PFC', 5),
('Sergeant', 2),
('SGT', 2)
], columns = ['rank', 'bannanas'])
d = {
'Private 1st Class': 'PFC',
'Private First Class': 'PFC',
'PFC': 'PFC',
'Sergeant': 'SGT',
'SGT': 'SGT'
}
df['merged_rank'] = df['rank'].map(d)
print(df)
rank bannanas merged_rank
0 Private 1st Class 3 PFC
1 Private First Class 2 PFC
2 PFC 5 PFC
3 Sergeant 2 SGT
4 SGT 2 SGT
print(df.groupby('merged_rank')['bannanas'].agg('sum'))
merged_rank
PFC 10
SGT 4
Name: bannanas, dtype: int64