为回归创建组标识符

时间:2014-11-06 22:37:43

标签: python pandas

我有一个包含多个标识符的数据框。我想为每个唯一的标识符组合创建一个新的“组标识符” - 稍后,我想使用statsmodels运行回归。也就是说,我有

  id1 id2 id3 
    A   1 100
    A   1 101
    B   1 100
    B   1 100

我想要

  id1 id2 id3 groupid 
    A   1 100       0
    A   1 101       1
    B   1 100       2
    B   1 100       2

id1id2id3作为标识符集。我知道我可以获得unique()来获取唯一的组,但是如何有效地将行编码到它们所属的独特组中?

调整@Bernie的答案以适应潜在的'NaN's:

# get a DataFrame with just the unique "keys"
df2 = df.replace(np.NaN, -1)
g = df2.groupby([u'id1',u'id2',u'id3'])
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns)
gdf = gdf.replace(-1, np.NaN)
# an idea is to re-use the index as the 'group_id'
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True)
gdf.reset_index(drop=True,inplace=True)
gdf['group_id'] = gdf.index

# merge on the three id columns
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())

2 个答案:

答案 0 :(得分:1)

这是你在找什么?

df = pd.DataFrame({'id1': ['A','A','B','B'],'id2':[1,1,1,1],'id3':[100,101,100,100]})

def makegroup(x,y,z):
    return str(x) + str(y) + str(z)

df['groupid'] = df.apply(lambda row: makegroup(row['id1'], row['id2'], row['id3']), axis=1)

groupiddict = {}
groupincrimenter = 1

for x in df['groupid'].unique():
    groupiddict[x] = groupincrimenter
    groupincrimenter += 1

df['groupidINT'] = df.apply(lambda row: int(groupiddict[row['groupid']]), axis=1)

这是输出:

  id1  id2  id3 groupid  groupidINT
0   A    1  100   A1100           1
1   A    1  101   A1101           2
2   B    1  100   B1100           3
3   B    1  100   B1100           3

答案 1 :(得分:1)

当然有无数的解决方案。这就是我到达的目的......

>>> df
  id1  id2  id3
0   A    1  100
1   A    1  101
2   B    1  100
3   B    1  100

# get a DataFrame with just the unique "keys"
g = df.groupby([u'id1',u'id2',u'id3'])
gdf = pd.DataFrame(g.groups.keys(),columns=df.columns)

# an idea is to re-use the index as the 'group_id'
# the next three commands support that 
gdf.sort([u'id1',u'id2',u'id3'],inplace=True)
gdf.reset_index(drop=True,inplace=True)
gdf['group_id'] = gdf.index

# merge on the three id columns
mdf = df.merge(gdf,how='inner',on=df.columns.tolist())

产地:

  id1  id2  id3  group_id
0   A    1  100         0
1   A    1  101         1
2   B    1  100         2
3   B    1  100         2