我有三个数据帧df_Male,df_female,Df_TransGender
样本数据框
df_Male
continent avg_count_country avg_age
Asia 55 5
Africa 65 10
Europe 75 8
df_Female
continent avg_count_country avg_age
Asia 50 7
Africa 60 12
Europe 70 0
df_Transgender
continent avg_count_country avg_age
Asia 30 6
Africa 40 11
America 80 10
现在我在下面这样连接
frames = [df_Male, df_Female, df_Transgender]
df = pd.concat(frames, keys=['Male', 'Female', 'Transgender'])
您可以看到America
中有df_transgender
,df_Male
和df_Female
中也有同样的欧洲
所以我必须以某种方式对其进行合并,以使其看起来像下面的,但不是手动的,因为行数可能很大
continent avg_count_country avg_age
Male 0 Asia 55 5
1 Africa 65 10
2 Europe 75 8
3 America 0 0
Female 0 Asia 50 7
1 Africa 60 12
2 Europe 70 0
3 America 0 0
Transgender 0 Asia 30 6
1 Africa 40 11
2 America 80 10
3 Europe 0 0
所以对于其他continent
值avg_count_country
和avg_age
应该为0
答案 0 :(得分:2)
您可以在连接前添加“性别”列。
我们将Categorical Data与groupby
一起使用来计算笛卡尔积。这也应该带来性能上的好处。
df = pd.concat([df_Male.assign(gender='Male'),
df_Female.assign(gender='Female'),
df_Transgender.assign(gender='Transgender')])
for col in ['gender', 'continent']:
df[col] = df[col].astype('category')
res = df.groupby(['gender', 'continent']).first().fillna(0).astype(int)
print(res)
avg_count_country avg_age
gender continent
Female Africa 60 12
America 0 0
Asia 50 7
Europe 70 0
Male Africa 65 10
America 0 0
Asia 55 5
Europe 75 8
Transgender Africa 40 11
America 80 10
Asia 30 6
Europe 0 0
答案 1 :(得分:1)
您可以重新索引一下。
from itertools import product
# Get rid of that number in the index, not sure why you'd need it
df.index = df.index.droplevel(-1)
# Add continents to the index
df = df.set_index('continent', append=True)
# Determine product of indices
ids = list(product(df.index.get_level_values(0).unique(), df.index.get_level_values(1).unique()))
# Reindex and fill missing with 0
df = df.reindex(ids).fillna(0).reset_index(level=-1)
df
现在是:
continent avg_count_country avg_age
Male Asia 55.0 5.0
Male Africa 65.0 10.0
Male Europe 75.0 8.0
Male America 0.0 0.0
Female Asia 50.0 7.0
Female Africa 60.0 12.0
Female Europe 70.0 0.0
Female America 0.0 0.0
Transgender Asia 30.0 6.0
Transgender Africa 40.0 11.0
Transgender Europe 0.0 0.0
Transgender America 80.0 10.0
如果您想要其他数字索引,则可以执行以下操作:
df.groupby(df.index).cumcount()
为每个组中的值编号。
答案 2 :(得分:1)
使用DataFrame.pivot
,对@jpp的答案稍作修改,就可以避免手动操作索引:
df = pd.concat([df_Male.assign(gender='Male'),
df_Female.assign(gender='Female'),
df_Transgender.assign(gender='Transgender')])
df.pivot('gender', 'continent').fillna(0).stack().astype(int)
avg_count_country avg_age
gender continent
Female Africa 60 12
America 0 0
Asia 50 7
Europe 70 0
Male Africa 65 10
America 0 0
Asia 55 5
Europe 75 8
Transgender Africa 40 11
America 80 10
Asia 30 6
Europe 0 0