熊猫如何基于其他列有条件地拆分一列?

时间:2018-09-19 12:53:05

标签: python pandas dataframe split

下面是我的熊猫数据框

Id        IsDef     Data                                       
1         Y         1a
2         N,N,N,Y   2a,2b,2c,2d
3         N,Y       3a,3b

如何使用Pandas如下拆分它?仅对“是”和“否”的前两个条目进行优先级排序

Id        DataY_1   DataY_2   DataN_1  DataN_2                                     
1         1a        NULL      NULL     NULL   
2         2d        NULL      2a       2b
3         3b        NULL      3a       NULL

1 个答案:

答案 0 :(得分:2)

您可以将列展平为DataFrame

from itertools import chain

d = df['Data'].str.split(',')
isdef = df['IsDef'].str.split(',')

df = pd.DataFrame({
    'Data' : list(chain.from_iterable(d)), 
    'IsDef' : list(chain.from_iterable(isdef)), 
    'Id' : df['Id'].repeat(d.str.len())
})

然后将cumcount用于每个组的计数器,并用boolean indexing过滤掉所有行,而不用前2个:

N = 2
df['g'] = df.groupby(['Id','IsDef']).cumcount()
df = df[df['g'] < N]

然后通过set_indexunstack重塑形状,并通过reindex添加缺失的类别。最后将MultiIndex的列f-string展平:

mux = pd.MultiIndex.from_product([['Y','N'], np.arange(N)])
df = df.set_index(['Id','IsDef', 'g'])['Data'].unstack([1,2]).reindex(columns=mux)
df.columns = [f'Data{i}_{j+1}' for i, j in df.columns]
print (df)
   DataY_1  DataY_2 DataN_1 DataN_2
Id                                 
1       1a      NaN     NaN     NaN
2       2d      NaN      2a      2b
3       3b      NaN      3a     NaN