正则表达式文本解析器

时间:2019-04-27 15:50:00

标签: pandas

我有

这样的数据框
ID  Series
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]
1500    [('forgot data pages info', 0, 22, 'NP')]
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]

我正在尝试将名为Series的列中的文本解析为名为Series1 Series2的不同列,以此类推,直到解析出的文本数量最多。

df_parsed = df['Series'].str[1:-1].str.split(', ', expand = True)

类似这样的东西:

ID  Series  Series1 Series2 Series3
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]    taxi instructions   consistent basis    the atc taxi clearance
1500    [('forgot data pages info', 0, 22, 'NP')]   forgot data pages info      
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]    hud correctly fotr approach

1 个答案:

答案 0 :(得分:0)

最终结果的格式不容易理解,但是也许您可以按照这一概念来创建新列:

def process(ls):
    return ' '.join([x[0] for x in ls])

df['Series_new'] = df['Series'].apply(lambda x: process(x))

如果您要创建N个新列(N = max_len(Series_list)),我想您可以先计算N个。然后,按照上述概念并正确填写NaN以创建N个新列。