从大型Pandas DataFrame中删除行的快速有效方法

时间:2015-05-15 00:35:32

标签: python pandas

我希望从大型Pandas DataFrame中删除行,其中包含基于用户在网站上执行的操作/事件的分析数据。所有用户操作流都以start事件开头,并以end事件结束。我想查找已完成特定事件的所有用户(例如signed up - 示例数据框中的索引13)并删除该事件之后的所有事件,直到(并包括)end事件。因此,在此示例中,viewed blog postpage viewvisited sitead campaign hitviewed blog postvisited sitepage view和{{必须从数据框中删除1}}。

end

我尝试过多种方式 - 使用In [26]: data Out[26]: event user 0 start user1 1 visited blog user1 2 page view user1 3 visited blog user1 4 viewed blog post user1 5 ad campaign hit user1 6 page view user1 7 visited site user1 8 visited blog user1 9 viewed blog post user1 10 visited site user1 11 page view user1 12 signed up user1 13 viewed blog post user1 14 page view user1 15 visited site user1 16 ad campaign hit user1 17 viewed blog post user1 18 visited site user1 19 page view user1 20 end user1 来识别正确的行或

np.where()

然而,这真的很慢!每个用户需要大约20秒。我有1000个用户,所以效率不高。如果可能的话,我希望以更快的方式做到这一点。

我在撰写这个问题时发现了另一个问题: 如果我不将removal_starts_at = data[(data.user == 'user1') & (data.event == 'signed up')] removal_ends_at = data[(data.user == 'user1') & (data.event == 'end')] data[data.user == 'user1'].drop(data.index[removal_start_at+1:removal_ends_at+1], inplace=True) 包含在数据框的子集中,它就会变得疯狂并占用计算机上的所有内存。如果我确实包含它,它实际上并没有进行子集化,而是向我发出关于[data.user == 'user1']的警告。

我对熊猫来说相对较新,所以我们很可能会采用更简单的方法来完成这项工作,并且我只是完全错误地完成了这项工作。我一直在考虑的想法是使用SettingWithCopy找到用户和组合的组合。事件直接或可能以更有效的方式进行子集化?

2 个答案:

答案 0 :(得分:5)

如果我理解正确,我们的想法是你在一个数据帧中拥有大量用户。所以我把它扩展为有2个用户。如果这是对的,那么这样的事情应该非常快:

df['keep'] = np.where( df['event'] == 'start', 1, np.nan )
df['keep'] = np.where( df['event'].shift() == 'signed up', 0, df['keep'] )
df['keep'] = df['keep'].ffill()

               event   user  keep
0              start  user1     1
1       visited blog  user1     1
2          page view  user1     1
3          signed up  user1     1
4   viewed blog post  user1     0
5          page view  user1     0
6                end  user1     0
7              start  user2     1
8       visited blog  user2     1
9          signed up  user2     1
10  viewed blog post  user2     0
11               end  user2     0

df[df['keep']==1]

          event   user  keep
0         start  user1     1
1  visited blog  user1     1
2     page view  user1     1
3     signed up  user1     1
7         start  user2     1
8  visited blog  user2     1
9     signed up  user2     1

答案 1 :(得分:2)

我只想存储我想要的索引,然后从那里使用切片。

In [15]: idx = data.query('user=="user1" and event=="signed up"').index[0]

In [16]: data[:idx+1]
Out[16]: 
               event   user
0              start  user1
1       visited blog  user1
2          page view  user1
3       visited blog  user1
4   viewed blog post  user1
5    ad campaign hit  user1
6          page view  user1
7       visited site  user1
8       visited blog  user1
9   viewed blog post  user1
10      visited site  user1
11         page view  user1
12         signed up  user1