我有一个数据框,其中包含一些如下所示的列,其中包含不同大小的数组:
column
["a_id","b","c","d"]
["d_ID","e","f"]
["h","i","j","k","l"]
["id_m","n","o","p"]
["ID_q","r","s"]
如果第一项包含" ID"我希望从每行的数组中删除第一项。或" id"。因此,预期输出将如下所示:
column
["b","c","d"]
["e","f"]
["h","i","j","k","l"]
["n","o","p"]
["r","s"]
我们如何在包含数据框中的数组元素的列中检查这个?
答案 0 :(得分:4)
编辑:我似乎误解了你的问题。此解决方案旨在删除中包含'id'
的任何元素,而不仅仅是第一个。
选项1
我认为最直接的解决方案是使用apply
:
df
col
0 [a_id, b, c, d]
1 [d_ID, e, f]
2 [h, i, j, k, l]
3 [id_m, n, o, p]
4 [ID_q, r, s]
df.col = df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
df
col
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]
选项2
或者,使用列表理解:
df.col = [(y[1:] if 'id' in y[0].lower() else y) for y in df.col]
df
col
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]
<强>计时强>
df = pd.concat([df] * 100000)
%%timeit
m = df['col'].str[0].str.contains('ID', case=False)
df['col'].mask(m, df['col'].str[1:])
1 loop, best of 3: 917 ms per loop
%timeit [(y[1:] if 'id' in y[0].lower() else y) for y in df.col]
1 loop, best of 3: 272 ms per loop
%timeit df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
1 loop, best of 3: 309 ms per loop
答案 1 :(得分:3)
使用str[0]
选择列表中的第一个值,然后按contains
检查ID
:
m = df['column'].str[0].str.contains('ID', case=False)
print (m)
0 True
1 True
2 False
3 True
4 True
Name: column, dtype: bool
然后使用str[1:]
mask
df['column'] = df['column'].mask(m, df['column'].str[1:])
print (df)
column
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]