我有一个包含多个标识符的数据框。我想为每个唯一的标识符组合创建一个新的“组标识符” - 稍后,我想使用statsmodels
运行回归。也就是说,我有
id1 id2 id3
A 1 100
A 1 101
B 1 100
B 1 100
我想要
id1 id2 id3 groupid
A 1 100 0
A 1 101 1
B 1 100 2
B 1 100 2
以id1
,id2
,id3
作为标识符集。我知道我可以获得unique()
来获取唯一的组,但是如何有效地将行编码到它们所属的独特组中?
调整@Bernie的答案以适应潜在的'NaN's:
# get a DataFrame with just the unique "keys"
df2 = df.replace(np.NaN, -1)
g = df2.groupby([u'id1',u'id2',u'id3'])
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns)
gdf = gdf.replace(-1, np.NaN)
# an idea is to re-use the index as the 'group_id'
# the next three commands support that
gdf.sort([u'id1',u'id2',u'id3'],inplace=True)
gdf.reset_index(drop=True,inplace=True)
gdf['group_id'] = gdf.index
# merge on the three id columns
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())
答案 0 :(得分:1)
这是你在找什么?
df = pd.DataFrame({'id1': ['A','A','B','B'],'id2':[1,1,1,1],'id3':[100,101,100,100]})
def makegroup(x,y,z):
return str(x) + str(y) + str(z)
df['groupid'] = df.apply(lambda row: makegroup(row['id1'], row['id2'], row['id3']), axis=1)
groupiddict = {}
groupincrimenter = 1
for x in df['groupid'].unique():
groupiddict[x] = groupincrimenter
groupincrimenter += 1
df['groupidINT'] = df.apply(lambda row: int(groupiddict[row['groupid']]), axis=1)
这是输出:
id1 id2 id3 groupid groupidINT
0 A 1 100 A1100 1
1 A 1 101 A1101 2
2 B 1 100 B1100 3
3 B 1 100 B1100 3
答案 1 :(得分:1)
当然有无数的解决方案。这就是我到达的目的......
>>> df
id1 id2 id3
0 A 1 100
1 A 1 101
2 B 1 100
3 B 1 100
# get a DataFrame with just the unique "keys"
g = df.groupby([u'id1',u'id2',u'id3'])
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns)
# an idea is to re-use the index as the 'group_id'
# the next three commands support that
gdf.sort([u'id1',u'id2',u'id3'],inplace=True)
gdf.reset_index(drop=True,inplace=True)
gdf['group_id'] = gdf.index
# merge on the three id columns
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())
产地:
id1 id2 id3 group_id 0 A 1 100 0 1 A 1 101 1 2 B 1 100 2 3 B 1 100 2