如何有效地从熊猫数据框系列中删除元素

时间:2020-08-02 01:52:14

标签: python pandas

我的熊猫数据框具有以下结构:

   Col1   |           Col2      |     Col3
   -------+---------------------+--------------
0   6     |    [a,b,c,d,e,f]    |     ....
1   4     |    [a,g,h,i]        |     ....
2   5     |    [a,b,j,k,l]      |     ....

我有一个元素列表,必须从Col2 [a,b,h]

中的所有列表中删除

最后我需要将其翻译为

   Col1   |           Col2  |     Col3
   -------+-----------------+--------------
0   4     |    [c,d,e,f]    |     ....
1   2     |    [g,i]        |     ....
2   3     |    [j,k,l]      |     ....

Col1Col2中元素的数量

我尝试过

def modify_data(dataset):
    ds = dataset.copy()      
    Col2 = dataset['Col2']
    remove_list = [a,b,h]
    removed_col2 = []
    counts = []
    for i,row in enumerate(Col2):
        cleaned = np.array(list(set(row)-set(remove_list)))
        removed_col2.append(cleaned)
        counts.append(len(cleaned))


    ds.loc[:,'Col1'] = counts
    ds.loc[:,'Col2'] = removed_col2
    return ds

但是性能太差了。例如,对于具有200,000行的数据集。

CPU times: user 11min 26s, sys: 24.2 s, total: 11min 50s
Wall time: 11min 48s

2 个答案:

答案 0 :(得分:3)

我会尝试的

df.Col2 = (df.Col2.map(set)-set(['a','b','h'])).map(list)
df.Col1 = df.Col2.str.len()
df
Out[111]: 
           Col2  Col1
0  [f, e, c, d]     4
1        [g, i]     2
2     [j, k, l]     3

答案 1 :(得分:1)

另一种解决方案,使用list comprehension

df = pd.DataFrame(
    {
        "col1": [6, 4, 3],
        "col2": [
            ["a", "b", "c", "d", "e", "f"],
            ["a", "g", "h", "i"],
            ["a", "b", "j", "k", "l"],
        ],
    }
)

df['col2'] = [[value for value in entry
               if value not in ('a','b','h')] 
              for entry in df.col2
             ]
df['col1'] = df.col2.str.len()


   col1     col2
0   4   [c, d, e, f]
1   2   [g, i]
2   3   [j, k, l]