计算列表与列表的熊猫列相交的长度

时间:2020-02-26 19:19:35

标签: python pandas numpy intersection set-intersection

我有一个唯一的随机整数列表和一个带有一列列表的数据框,如下所示:

>>> panel
    [1, 10, 9, 5, 6]

>>> df
       col1 
    0  [1, 5]
    1  [2, 3, 4]
    2  [9, 10, 6]

我想要的输出是panel与数据框中每个单独列表之间的重叠长度:

>>> result
       col1        res
    0  [1, 5]      2
    1  [2, 3, 4]   0
    2  [9, 10, 6]  3

当前,我正在使用apply函数,但是我想知道是否有更快的方法,因为我需要创建很多面板并为每个面板循环完成此任务。

# My version right now
def cntOverlap(panel, series):
    # Typically the lists inside df will be much shorter than panel, 
    # so I think the fastest way would be converting the panel into a set 
    # and loop through the lists within the dataframe

    return sum(1 if x in panel for x in series)
    #return len(np.setxor1d(list(panel), series))
    #return len(panel.difference(series))


for i, panel in enumerate(list_of_panels):
    panel = set(panel)
    df[f"panel_{i}"] = df["col1"].apply(lambda x: cntOverlap(panel, x))

2 个答案:

答案 0 :(得分:2)

您可以使用explode(可从0.25+熊猫购买)和isin

df['col1'].explode().isin(panel).sum(level=0)

输出:

0    2.0
1    0.0
2    3.0
Name: col1, dtype: float64

答案 1 :(得分:2)

由于每行数据的长度可变,我们需要迭代(显式或隐式,即在幕后)停留在Python中。但是,我们可以优化到每次迭代计算最小化的水平。遵循这种哲学,这是一种具有数组分配和一些掩盖的思想-

# l is input list of unique random integers
s = df.col1
max_num = 10 # max number in df, if not known use : max(max(s))
map_ar = np.zeros(max_num+1, dtype=bool)
map_ar[l] = 1
df['res'] = [map_ar[v].sum() for v in s]

或者使用2D数组分配来进一步最小化每次迭代计算-

map_ar = np.zeros((len(df),max_num+1), dtype=bool)
map_ar[:,l] = 1
for i,v in enumerate(s):
    map_ar[i,v] = 0
df['res'] = len(l)-map_ar.sum(1)