我有一个包含两列的数据框。我想创建第三列 前两列的“和”,但没有每列的第一位。我认为最好在示例中显示:
col1 col2 col3 (need to make)
abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
psdb_what_I_want2 what_I_want2
vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
ertsa_what_I_want5 what_I_want5
abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
请注意,what_I_want#每行都会有所不同,但同一行中的列之间却相同。每行的前缀始终相同,但行之间可以不同/重复。显示为空白的单元格是“”字符串。
我到目前为止的代码:
df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)
从那里,我只需要拆分的第二个(或最后一个)元素,所以我尝试了以下两个操作:
df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]
这两个返回的错误。我认为的第一个错误是由于值重复(ValueError: cannot reindex from a duplicate axis
)。第二个是键值错误。
答案 0 :(得分:2)
您实际上已经很接近了,只需要为str[1]
选择正确的切片,同时为空白单元格选择fillna
:
m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
另一种方法是使用apply
,您可以一次将split
应用于多列:
m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
答案 1 :(得分:2)
您可以replace()
进行所有字符运算,直到第一个下划线,然后在apply()
上join()
sum()
或axis=1
:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)
或者:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)
或者:
df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
.str.join(''))
col1 col2 Col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2 what_I_want2I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 NaN ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6