我有以下数据:
df = pd.DataFrame({ 'Column_A': [1,2,3,4],
'Column_B': [["X1", "X2", "Y1"],
["X3", "Y2"],
["X4", "X5"],
["X5", "Y3", "Y4"]],})
Column_A Column_B
0 1 [X1, X2, Y1]
1 2 [X3, Y2]
2 3 [X4, X5]
3 4 [X5, Y3, Y4]
我希望删除第二列中所有以Y开头的字符串。所需的输出:
Column_A Column_B
0 1 [X1, X2]
1 2 [X3]
2 3 [X4, X5]
3 4 [X5]
答案 0 :(得分:2)
使用嵌套列表推导和startswith
过滤:
df['Column_B'] = [[y for y in x if not y.startswith('Y')] for x in df['Column_B']]
apply
替代:
df['Column_B'] = df['Column_B'].apply(lambda x: [y for y in x if not y.startswith('Y')])
或使用filter
:
df['Column_B'] = [list(filter(lambda y: not y.startswith('Y'), x)) for x in df['Column_B']]
print (df)
Column_A Column_B
0 1 [X1, X2]
1 2 [X3]
2 3 [X4, X5]
3 4 [X5]
性能:
取决于行数,列表中的值数和匹配的值数
#[40000 rows x 2 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#print (df)
In [142]: %timeit df['Column_B'] = [[y for y in x if not y.startswith('Y')] for x in df['Column_B']]
23.7 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [143]: %timeit df['Column_B'] = [list(filter(lambda y: not y.startswith('Y'), x)) for x in df['Column_B']]
36.5 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [144]: %timeit df['Column_B'] = df['Column_B'].apply(lambda x: [y for y in x if not y.startswith('Y')])
30.4 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)