在熊猫的两列文字的第二部分之外创建一列

时间:2019-06-28 18:38:47

标签: python pandas

我有一个包含两列的数据框。我想创建第三列 前两列的“和”,但没有每列的第一位。我认为最好在示例中显示:

col1                col2                 col3 (need to make)
abc_what_I_want1    abc_what_I_want1     what_I_want1what_I_want1
psdb_what_I_want2                        what_I_want2
vxc_what_I_want3    vxc_what_I_want3     what_I_want3what_I_want3
qk_what_I_want4     qk_what_I_want4      what_I_want4what_I_want4
                    ertsa_what_I_want5   what_I_want5
abc_what_I_want6    abc_what_I_want6     what_I_want6what_I_want6

请注意,what_I_want#每行都会有所不同,但同一行中的列之间却相同。每行的前缀始终相同,但行之间可以不同/重复。显示为空白的单元格是“”字符串。

我到目前为止的代码:

 df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)

从那里,我只需要拆分的第二个(或最后一个)元素,所以我尝试了以下两个操作:

 df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
 df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]

这两个返回的错误。我认为的第一个错误是由于值重复(ValueError: cannot reindex from a duplicate axis)。第二个是键值错误。

2 个答案:

答案 0 :(得分:2)

您实际上已经很接近了,只需要为str[1]选择正确的切片,同时为空白单元格选择fillna

m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m

                col1                col2                      col3
0   abc_what_I_want1    abc_what_I_want1  what_I_want1what_I_want1
1  psdb_what_I_want2                                  what_I_want2
2   vxc_what_I_want3    vxc_what_I_want3  what_I_want3what_I_want3
3    qk_what_I_want4     qk_what_I_want4  what_I_want4what_I_want4
4                     ertsa_what_I_want5              what_I_want5
5   abc_what_I_want6    abc_what_I_want6  what_I_want6what_I_want6

另一种方法是使用apply,您可以一次将split应用于多列:

m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']

                col1                col2                      col3
0   abc_what_I_want1    abc_what_I_want1  what_I_want1what_I_want1
1  psdb_what_I_want2                                  what_I_want2
2   vxc_what_I_want3    vxc_what_I_want3  what_I_want3what_I_want3
3    qk_what_I_want4     qk_what_I_want4  what_I_want4what_I_want4
4                     ertsa_what_I_want5              what_I_want5
5   abc_what_I_want6    abc_what_I_want6  what_I_want6what_I_want6

答案 1 :(得分:2)

您可以replace()进行所有字符运算,直到第一个下划线,然后在apply()join() sum()axis=1

df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)

或者:

df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)

或者:

df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
                                                             .str.join(''))

                col1                col2                      Col3
0   abc_what_I_want1    abc_what_I_want1  what_I_want1what_I_want1
1  psdb_what_I_want2        what_I_want2       what_I_want2I_want2
2   vxc_what_I_want3    vxc_what_I_want3  what_I_want3what_I_want3
3    qk_what_I_want4     qk_what_I_want4  what_I_want4what_I_want4
4                NaN  ertsa_what_I_want5              what_I_want5
5   abc_what_I_want6    abc_what_I_want6  what_I_want6what_I_want6