我在pandas dataframe中有以下信息为df,我正在尝试将代码值作为每个adm_id的列值,并命名列及其位置。
ID ADM_ID code
108 183350 7100
108 183350 5849
108 183350 5780
108 183350 99811
108 183350 4466
108 183350 40301
108 183350 58281
108 183350 E8798
108 183350 58889
108 183350 4430
108 183350 78659
109 128755 4372
109 128755 78039
109 128755 7100
109 128755 40391
109 128755 4251
109 128755 2859
109 164029 40301
109 164029 7100
109 164029 5856
109 164029 V4983
109 164029 58381
109 164029 3643
109 108375 7100
109 108375 40301
109 108375 5856
109 108375 58381
109 108375 3643
109 108375 28521
109 193281 40301
109 193281 5856
109 193281 7100
109 193281 7907
109 193281 4254
109 193281 99662
109 193281 99812
109 193281 36001
109 193281 11289
109 193281 V5865
109 193281 7821
109 193281 28521
109 193281 37900
109 193281 37632
109 193281 37005
109 193281 36400
我想将其转换如下,
ID ADM_ID cnt code1 code2 code3 code4 code5 code6 code7 code8 code9 code10 code11 code12 code13 code14 code15 code16
108 183350 11 7100 5849 5780 99811 4466 40301 58281 E8798 58889 4430 78659
109 128755 6 4372 78039 7100 40391 4251 2859
109 164029 6 40301 7100 5856 V4983 58381 3643
109 108375 6 7100 40301 5856 58381 3643 28521
109 193281 16 40301 5856 7100 7907 4254 99662 99812 36001 11289 V5865 7821 28521 37900 37632 37005 36400
我不能保证代码只有16个,每个id都有随机数量的代码。有人可以帮我做这件事。
谢谢,
答案 0 :(得分:0)
基于什么原则为列分配基准?根据您发布的数据,不清楚数据分配给新列的依据。
似乎.pivot()函数在这里运行良好。但是,您需要添加一些键或.groupby()参数来重塑数据帧。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html
https://pandas.pydata.org/pandas-docs/stable/reshaping.html
更新
首先通过分组辩论 -
new_df = old_df.groupby('admin_id')
然后传递一个pivot参数 -
newer_df = new_dt.pivot('code')
答案 1 :(得分:0)
您可以使用:
list
列的值groupby
和apply
code
Series
列l
cnt
来自df
的值的DataFrame
构造函数
add_suffix
了解新列名和reset_index
insert
新列为3.
列(2
,因为python从0开始计算)df1 = df.groupby(['ID','ADM_ID'])['code'].apply(list)
l = df1.str.len()
df = pd.DataFrame(df1.values.tolist(),
index=df1.index,
columns = range(1, l.max() +1)) \
.add_prefix('code') \
.reset_index()
df.insert(2, 'cnt', l.values)
print (df)
ID ADM_ID cnt code1 code2 code3 code4 code5 code6 code7 code8 \
0 108 183350 11 7100 5849 5780 99811 4466 40301 58281 E8798
1 109 108375 6 7100 40301 5856 58381 3643 28521 None None
2 109 128755 6 4372 78039 7100 40391 4251 2859 None None
3 109 164029 6 40301 7100 5856 V4983 58381 3643 None None
4 109 193281 16 40301 5856 7100 7907 4254 99662 99812 36001
code9 code10 code11 code12 code13 code14 code15 code16
0 58889 4430 78659 None None None None None
1 None None None None None None None None
2 None None None None None None None None
3 None None None None None None None None
4 11289 V5865 7821 28521 37900 37632 37005 36400
答案 2 :(得分:0)
使用groupby
:
df2 = df.groupby(['ID', 'ADM_ID'])['code'].agg([np.count_nonzero,
lambda x: tuple(x)])
df3 = pd.concat([df2.reset_index(),
pd.DataFrame(df2['<lambda>'].tolist())],
axis=1)
del df3['<lambda>']
cols = ['ID', 'ADM_ID', 'cnt']
cols.extend(['code'+str(i) for i in range(1, len(df3.columns)-2)])
df3.columns = cols
df3
Out[52]:
ID ADM_ID cnt code1 code2 code3 code4 code5 code6 code7 code8 \
0 108 183350 11 7100 5849 5780 99811 4466 40301 58281 E8798
1 109 108375 6 7100 40301 5856 58381 3643 28521 None None
2 109 128755 6 4372 78039 7100 40391 4251 2859 None None
3 109 164029 6 40301 7100 5856 V4983 58381 3643 None None
4 109 193281 16 40301 5856 7100 7907 4254 99662 99812 36001
code9 code10 code11 code12 code13 code14 code15 code16
0 58889 4430 78659 None None None None None
1 None None None None None None None None
2 None None None None None None None None
3 None None None None None None None None
4 11289 V5865 7821 28521 37900 37632 37005 36400
如果您可以获得完整的数字代码,我只是因为您看起来如此接近而提及,那么您可以改为使用pivot
和np.sort
:
df2 = df.pivot(index='ADM_ID', columns='code', values='code')
df2.values.sort()
df2.dropna(how='all', axis=1, inplace=True)
df2.columns = ['code'+str(i) for i in range(1, len(df2.columns)+1)]
df2.insert(0, 'cnt', df2.count(axis=1))
df2
Out[71]:
cnt code1 code2 code3 code4 code5 code6 code7 \
ADM_ID
108375 6 3643.0 5856.0 7100.0 28521.0 40301.0 58381.0 NaN
128755 6 2859.0 4251.0 4372.0 7100.0 40391.0 78039.0 NaN
164029 6 3643.0 4983.0 5856.0 7100.0 40301.0 58381.0 NaN
183350 11 4430.0 4466.0 5780.0 5849.0 7100.0 8798.0 40301.0
193281 16 4254.0 5856.0 5865.0 7100.0 7821.0 7907.0 11289.0
code8 code9 code10 code11 code12 code13 code14 \
ADM_ID
108375 NaN NaN NaN NaN NaN NaN NaN
128755 NaN NaN NaN NaN NaN NaN NaN
164029 NaN NaN NaN NaN NaN NaN NaN
183350 58281.0 58889.0 78659.0 99811.0 NaN NaN NaN
193281 28521.0 36001.0 36400.0 37005.0 37632.0 37900.0 40301.0
code15 code16
ADM_ID
108375 NaN NaN
128755 NaN NaN
164029 NaN NaN
183350 NaN NaN
193281 99662.0 99812.0
在E
和V
被剥离的情况下,这个数据的速度有点快(2.63 ms vs 4.84 ms),大约快12倍(5.92 ms vs 74.2 ms)在具有相同数量的adm_ids和100倍代码的数据框架上进行测试。
不幸的是,numpy数组的sort
似乎与字符串数组中的NaN
很好地匹配,而且我注意到的任何变通方法似乎都比groupby更昂贵。
答案 3 :(得分:0)
使用方法链接一次完成此操作。第一组&#39; ID&#39;和&#39; ADM_ID&#39;,然后转换组值并计数到列表,接下来将列表转换为列,添加前缀,重命名计数列名称,最后重置索引。
此解决方案将自动处理更多或更少的列。
(df.groupby(['ID','ADM_ID'])
.apply(lambda x: [len(x)]+x.code.tolist())
.apply(pd.Series)
.add_prefix('code')
.rename(columns={'code0':'cnt'})
.reset_index()
)
Out[389]:
ID ADM_ID cnt code1 code2 code3 code4 code5 code6 code7 code8 \
0 108 183350 11 7100 5849 5780 99811 4466 40301 58281 E8798
1 109 108375 6 7100 40301 5856 58381 3643 28521 NaN NaN
2 109 128755 6 4372 78039 7100 40391 4251 2859 NaN NaN
3 109 164029 6 40301 7100 5856 V4983 58381 3643 NaN NaN
4 109 193281 16 40301 5856 7100 7907 4254 99662 99812 36001
code9 code10 code11 code12 code13 code14 code15 code16
0 58889 4430 78659 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 11289 V5865 7821 28521 37900 37632 37005 36400