Question

我有一个如下所示的列表：

lst = ['a','b','c']

和一个如下所示的数据框：

id  col1
1   ['a','c']
2   ['b']
3   ['b', 'a']

我希望在数据框中创建一个新列，该列具有来自col1的lst和各个列表的交集长度

id  col1         intersect
1   ['a','c']    2
2   ['b']        1
3   ['d', 'a']   1

目前我的代码如下：

df['intersection'] = np.nan
for i, r in df.iterrows():
    ## If-Statement to deal with Nans in col1
    if r['col1'] == r['col1']:
       df['intersection'][i] = len(set(r['col1']).intersection(set(lst)))

问题是这个代码在200K行的数据集上非常耗时，并且与200个元素的列表相交。有没有办法更有效地做到这一点？

谢谢！

Answer 1

你试过这个吗？

lstset = set(lst)
df['intersection'] = df['col1'].apply(lambda x: len(set(x).intersection(lstset)))

另一种可能性是

df['intersection'] = df['col1'].apply(lambda x: len([1 for item in x if item in lst]))

有效地迭代pandas行

1 个答案: