如果未包含在列表中,则从pandas列中删除它们

时间:2019-11-09 21:08:57

标签: python pandas

假设我有一个这样的列表

certificates = [ISO9001, ISO203, CE2234]

和这样的数据框:

company_certificates
[ISO303, ISO9001]
[GlobalGAP12, ISO203]
[EuroGAP]

如果证书列表中未包含这些元素,我想从company_certificates中删除这些元素。我知道我可以做这样的事情:

df['company_certificates'] = df['company_certificates'].apply(lambda x: [i for i in x if i in certificates])

最终输出为:

company_certificates
[ISO9001]
[ISO203]
[]

但是鉴于我的数据帧很大,我需要更有效的方法来执行此操作。有什么想法吗?

2 个答案:

答案 0 :(得分:1)

检查列表中的每个项目是否出现在certificates列表中。并为其中至少存在一个的行创建一个掩码。然后用空列表替换该行中的值

>>> mask = ~df['company_certificates'].explode().isin(certificates) \
              .groupby(level=0).any()
>>> mask
index
0    False
1    False
2     True

>>> df.loc[mask,'company_certificates'] = [[]*mask.sum()]
>>> df
    company_certificates
0      [ISO303, ISO9001]
1  [GlobalGAP12, ISO203]
2                     []

答案 1 :(得分:1)

data = {'company_certificates': [['ISO303', 'ISO9001'], ['GlobalGAP12', 'ISO203'], ['EuroGAP']]}
data['company_certificates'] *= 1000000

df = pd.DataFrame(data)
certificates = ['ISO9001', 'ISO203', 'CE2234']

# 3.1 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
c2 = df['company_certificates'].tolist()
c1set = frozenset(certificates)
df['match'] = [[n for n in lst if n in c1set] for lst in c2]

# 4.32 s ± 578 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: [i for i in x if i in certificates])

# 7.23 s ± 616 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: list(set(x).intersection(certificates)))     

# 9.43 s ± 913 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: list(filter(lambda y: y in x, certificates)))

# 32 s ± 2.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
mask = ~df['company_certificates'].explode().isin(certificates).reset_index() \
               .groupby('index').any()['company_certificates']
df.loc[mask,'company_certificates'] = [[]*mask.sum()]