跟着我上一个问题-Creating new columns based on value from another column in pandas
我现在的目标是:
Code Name Level1 Level1Name Level2 Level2Name Level3 Level3Name
0 A USA A USA
1 AM Massachusetts A USA AM Massachusetts
2 AMB Boston A USA AM Massachusetts AMB Boston
3 AMS Springfield A USA AM Massachusetts AMS Springfiled
4 D Germany D Germany
5 DB Brandenburg D Germany DB Brandenburg
6 DBB Berlin D Germany DB Brandenburg DBB Berlin
7 DBD Dresden D Germany DB Brandenburg DBD Dresden
到目前为止,我基于Scott Boston的代码:
match 0 1 2
0 A A A
1 A AM AM
2 A AM AMB
3 A AM AMS
4 D D D
5 D DB DB
6 D DB DBB
7 D DB DBD
我的方法是循环遍历每一列,并删除与该列中其余值长度不同但似乎无法弄清楚逻辑的行。
示例代码:
df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')
df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').fillna('').cumsum(axis=1)
df_level
谢谢您的帮助!
答案 0 :(得分:1)
让我们尝试一下:
df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match', fill_value='')
df_level = df_level.cumsum(axis=1).mask(df_level == '')
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names = pd.concat([df_level[i].map(s_map) for i in df_level.columns],
axis=1,
keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out
输出:
Code Name Level1 Level2 Level3 Level1Name Level2Name Level3Name
0 A USA A NaN NaN USA NaN NaN
1 AM Massachusetts A AM NaN USA Massachusetts NaN
2 AMB Boston A AM AMB USA Massachusetts Boston
3 AMS Springfield A AM AMS USA Massachusetts Springfield
4 D Germany D NaN NaN Germany NaN NaN
5 DB Brandenburg D DB NaN Germany Brandenburg NaN
6 DBB Berlin D DB DBB Germany Brandenburg Berlin
7 DBD Dresden D DB DBD Germany Brandenburg Dresden
答案 1 :(得分:0)
此方法使用apply
和功能:
import pandas as pd
l = ['A', 'AM', 'AMB', 'AMS', 'D', 'DB', 'DBB', 'DBD']
df = pd.DataFrame(l).rename(columns={0:'code'})
def level2(col):
if len(col) == 1:
return ''
elif len(col) >= 2:
return col[:2]
def level3(col):
if len(col) <= 2:
return ''
elif len(col) > 2:
return col[:3]
df['Level1'] = df['code'].apply(lambda col: col[0])
df['Level2'] = df['code'].apply(level2)
df['Level3'] = df['code'].apply(level3)
print(df)
输出:
code Level1 Level2 Level3
0 A A
1 AM A AM
2 AMB A AM AMB
3 AMS A AM AMS
4 D D
5 DB D DB
6 DBB D DB DBB
7 DBD D DB DBD
这些功能也可以重构为一个功能,但是您的要点是。我建议将apply
用于其他熊猫方法,因为apply
更容易记住和自定义。希望这会有所帮助。
答案 2 :(得分:0)
我采用了另一种方法:假设您没有太多的层次,请遍历代码的长度。
import pandas as pd
df=pd.DataFrame({
'Code':['A','AM','AMB'],
'Name':['USA','Massachusetts',"Boston"]
})
# prepare
res=pd.DataFrame({
'Code':[]
})
df['len']=df['Code'].str.len()
cols=[]
for x in range(df['len'].max()):
dfX=df[df['len']==x+1].copy()
dfX['prefix']=dfX['Code'].str.slice(stop=x)
dfX=dfX.merge(res,how='left',left_on='prefix',right_on='Code')
dfX[f'Level{x+1}']=dfX['Code_x']
dfX[f'Level{x+1}Name']=dfX['Name']
dfX[f'Code']=dfX['Code_x']
cols+=[f'Level{x+1}',f'Level{x+1}Name']
res=res.append(dfX[['Code']+cols],sort=False)
res
Code Level1 Level1Name Level2 Level2Name Level3 Level3Name
0 A A USA NaN NaN NaN NaN
0 AM A USA AM Massachusetts NaN NaN
0 AMB A USA AM Massachusetts AMB Boston
首先将1级添加到查找表中;然后2和3级... 代码看起来很丑陋,但希望易于理解。