Question

我有

这样的数据框

ID  Series
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]
1500    [('forgot data pages info', 0, 22, 'NP')]
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]

我正在尝试将名为Series的列中的文本解析为名为Series1 Series2的不同列，以此类推，直到解析出的文本数量最多。

df_parsed = df['Series'].str[1:-1].str.split(', ', expand = True)

类似这样的东西：

ID  Series  Series1 Series2 Series3
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]    taxi instructions   consistent basis    the atc taxi clearance
1500    [('forgot data pages info', 0, 22, 'NP')]   forgot data pages info      
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]    hud correctly fotr approach

Answer 1

最终结果的格式不容易理解，但是也许您可以按照这一概念来创建新列：

def process(ls):
    return ' '.join([x[0] for x in ls])

df['Series_new'] = df['Series'].apply(lambda x: process(x))

如果您要创建N个新列（N = max_len(Series_list)），我想您可以先计算N个。然后，按照上述概念并正确填写NaN以创建N个新列。

正则表达式文本解析器

1 个答案: