Question

我在python中有以下数据框：

df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28], 
                    'occupation': ['A1|A2|A3', 'B1|B2|B3', 'C1|C2|C3', 'D1|D2|D3']})

请注意“职业”字段，其值用'|'分隔。

我想在数据框中添加两个新列，比如说new1和new2，其值分别为A1和A2，B1和B2等。

我尝试使用以下代码实现这一目标：

df['new1'] = df['occupation'].str.split("|", n = 2,expand = False)

得到的结果是：

    name    age occupation  new1
0   Vinay   22  A1|A2|A3    [A1, A2, A3]
1   Kushal  25  B1|B2|B3    [B1, B2, B3]
2   Aman    24  C1|C2|C3    [C1, C2, C3]
3   Saif    28  D1|D2|D3    [D1, D2, D3]

我不想在新字段中看到A1，A2，A3等。预期输出：

        name    age occupation  new1 new2
    0   Vinay   22  A1|A2|A3    [A1] [A2]
    1   Kushal  25  B1|B2|B3    [B1] [B2]
    2   Aman    24  C1|C2|C3    [C1] [C2]
    3   Saif    28  D1|D2|D3    [D1] [D2]

请提出可能的解决方案。

Answer 1

为了提高性能，请结合使用str.split和列表理解：

u = pd.DataFrame([
    x.split('|')[:2] for x in df.occupation], columns=['new1', 'new2'], index=df.index)
u

  new1 new2
0   A1   A2
1   B1   B2
2   C1   C2
3   D1   D2

pd.concat([df, u], axis=1)

     name  age occupation new1 new2
0   Vinay   22   A1|A2|A3   A1   A2
1  Kushal   25   B1|B2|B3   B1   B2
2    Aman   24   C1|C2|C3   C1   C2
3    Saif   28   D1|D2|D3   D1   D2

为什么列表理解很快？您可以在For loops with pandas - When should I care?上阅读更多内容。

Answer 2

这是一个使用正则表达式和命名捕获组的选项。您可以通过在解释器中运行pd.Series.str.extract?来引用文档字符串以获取更多详细信息。

# get the new columns in a separate dataframe
df_ = df['occupation'].str.extract('^(?P<new1>\w{2})\|(?P<new2>\w{2})')

# add brackets around each item in the new dataframe
df_ = df_.applymap(lambda x: '[{}]'.format(x))

# add the new dataframe to your original to get the desired result
df = df.join(df_)

在熊猫中部分拆分字符串列

2 个答案: