Python pandas dataframe:在数组列中,如果第一项包含特定字符串,则从数组

时间:2017-11-08 06:34:27

标签: python arrays python-3.x pandas dataframe

我有一个数据框,其中包含一些如下所示的列,其中包含不同大小的数组:

column
["a_id","b","c","d"]
["d_ID","e","f"]
["h","i","j","k","l"]
["id_m","n","o","p"]
["ID_q","r","s"]

如果第一项包含" ID"我希望从每行的数组中删除第一项。或" id"。因此,预期输出将如下所示:

column
["b","c","d"]
["e","f"]
["h","i","j","k","l"]
["n","o","p"]
["r","s"]

我们如何在包含数据框中的数组元素的列中检查这个?

2 个答案:

答案 0 :(得分:4)

编辑:我似乎误解了你的问题。此解决方案旨在删除中包含'id'的任何元素,而不仅仅是第一个。

选项1
我认为最直接的解决方案是使用apply

df

               col
0  [a_id, b, c, d]
1     [d_ID, e, f]
2  [h, i, j, k, l]
3  [id_m, n, o, p]
4     [ID_q, r, s]


df.col = df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))

df
               col
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]

选项2
或者,使用列表理解

df.col = [(y[1:] if 'id' in y[0].lower() else y)  for y in df.col]  

df

               col
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]

<强>计时

df = pd.concat([df] * 100000)
%%timeit
m = df['col'].str[0].str.contains('ID', case=False)
df['col'].mask(m, df['col'].str[1:])

1 loop, best of 3: 917 ms per loop
%timeit [(y[1:] if 'id' in y[0].lower() else y)  for y in df.col]  
1 loop, best of 3: 272 ms per loop
%timeit df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
1 loop, best of 3: 309 ms per loop

答案 1 :(得分:3)

使用str[0]选择列表中的第一个值,然后按contains检查ID

m = df['column'].str[0].str.contains('ID', case=False)
print (m)
0     True
1     True
2    False
3     True
4     True
Name: column, dtype: bool

然后使用str[1:]

将其移至mask
df['column'] = df['column'].mask(m, df['column'].str[1:])
print (df)
            column
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]