假设我有一个这样的列表
certificates = [ISO9001, ISO203, CE2234]
和这样的数据框:
company_certificates
[ISO303, ISO9001]
[GlobalGAP12, ISO203]
[EuroGAP]
如果证书列表中未包含这些元素,我想从company_certificates中删除这些元素。我知道我可以做这样的事情:
df['company_certificates'] = df['company_certificates'].apply(lambda x: [i for i in x if i in certificates])
最终输出为:
company_certificates
[ISO9001]
[ISO203]
[]
但是鉴于我的数据帧很大,我需要更有效的方法来执行此操作。有什么想法吗?
答案 0 :(得分:1)
检查列表中的每个项目是否出现在certificates
列表中。并为其中至少存在一个的行创建一个掩码。然后用空列表替换该行中的值
>>> mask = ~df['company_certificates'].explode().isin(certificates) \
.groupby(level=0).any()
>>> mask
index
0 False
1 False
2 True
>>> df.loc[mask,'company_certificates'] = [[]*mask.sum()]
>>> df
company_certificates
0 [ISO303, ISO9001]
1 [GlobalGAP12, ISO203]
2 []
答案 1 :(得分:1)
data = {'company_certificates': [['ISO303', 'ISO9001'], ['GlobalGAP12', 'ISO203'], ['EuroGAP']]}
data['company_certificates'] *= 1000000
df = pd.DataFrame(data)
certificates = ['ISO9001', 'ISO203', 'CE2234']
# 3.1 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
c2 = df['company_certificates'].tolist()
c1set = frozenset(certificates)
df['match'] = [[n for n in lst if n in c1set] for lst in c2]
# 4.32 s ± 578 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: [i for i in x if i in certificates])
# 7.23 s ± 616 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: list(set(x).intersection(certificates)))
# 9.43 s ± 913 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['match'] = df['company_certificates'].apply(lambda x: list(filter(lambda y: y in x, certificates)))
# 32 s ± 2.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
mask = ~df['company_certificates'].explode().isin(certificates).reset_index() \
.groupby('index').any()['company_certificates']
df.loc[mask,'company_certificates'] = [[]*mask.sum()]