Question

我很好奇，看看有人是否可以使用纯熊猫解决此问题，而不是像过去那样使用for循环。当前的解决方案与我要分组的参数数量成指数比例。

因此，最初我有一个数据帧，如下所示：

        theday   device  event1  event2
0   2019-02-21  desktop       0       0
1   2019-02-22  desktop       1       1
2   2019-02-23  desktop       0       0
3   2019-02-24  desktop       1       1
4   2019-02-21    other       0       0
5   2019-02-22    other       1       1
6   2019-02-23    other       0       0
7   2019-02-24    other       1       1
8   2019-02-21  desktop       0       1
9   2019-02-22  desktop       1       0
10  2019-02-23    other       0       1
11  2019-02-24    other       1       0
12  2019-02-21  desktop       0       1
13  2019-02-22  desktop       1       0
14  2019-02-23    other       0       1
15  2019-02-24    other       1       0

您可以使用以下代码生成数据框：

import pandas as pd 
import numpy as np 
d = {'theday': ['2019-02-21','2019-02-22', '2019-02-23', '2019-02-24','2019-02-21','2019-02-22', '2019-02-23', '2019-02-24', '2019-02-21','2019-02-22', '2019-02-23', '2019-02-24', '2019-02-21','2019-02-22', '2019-02-23', '2019-02-24'], 'device': ['desktop', 'desktop','desktop','desktop', 'other','other','other','other', 'desktop','desktop', 'other','other', 'desktop','desktop', 'other','other' ], 'event1': [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1], 'event2': [0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0]} 
df = pd.DataFrame(data=d)

每行代表一个用户。因此，我想针对每个给定的日期和设备计算看过事件1和看过事件2的用户比率除以仅看过事件1的用户数。要将其放在等式中：conversion = number_users_seen1_and_seen2 / number_users_seen1。

看到event1和event2意味着用户在event1列中有一个“ 1”，在event2列中有一个“ 1”。并且看到事件1意味着用户在event1列中有一个“ 1”。

为此，我的解决方案功能如下：

def get_ratios(df, e1, e2):

    temp_list = []
    for device in df['device'].unique(): # iterate through devices
        for theday in df['theday'].unique(): # iterate throug days
            current_df = df[(df['theday'] == theday) & (df['device'] == device)]
            if len(current_df[current_df[e1] == 1]) == 0: 
                conversion = 0 
            else: 
                conversion = len(current_df[(current_df[e1] == 1) & (current_df[e2] == 1)]) /len(current_df[current_df[e1] == 1]) 

            temp_dict = {"theday": theday, "device": device, "conversion": conversion}
            temp_list.append(temp_dict)

    return pd.DataFrame(temp_list)

如果我做get_ratios(df, "event1", "event2")，我会得到：

   conversion   device      theday
0    0.000000  desktop  2019-02-21
1    0.333333  desktop  2019-02-22
2    0.000000  desktop  2019-02-23
3    1.000000  desktop  2019-02-24
4    0.000000    other  2019-02-21
5    1.000000    other  2019-02-22
6    0.000000    other  2019-02-23
7    0.333333    other  2019-02-24

这种方法存在一些问题：

（1）该函数当前仅支持theday和device，但是如果我想包含更多参数，则需要修改代码。

（2）函数运行时的缩放比例与我按（按指数）分组的参数数量的比例非常差。

（3）我正在熊猫之外执行部分逻辑。

然后我的问题是，如果不使用for循环，而仅使用特定于熊猫的函数，是否可以实现相同的目标？

Answer 1

目前尚不清楚如何计算转化率，但您可以在此答案中进行更改。我建议使用apply函数。

步骤1：创建数据框

import pandas as pd
import numpy as np
d = {'date': ['2019-02-21','2019-02-22', '2019-02-23', '2019-02-24','2019-02-21','2019-02-22', '2019-02-23', '2019-02-24'], 'device': ['desktop', 'desktop','desktop','desktop', 'other','other','other','other' ],
     'event1': [0,1,0,1,0,1,0,1], 'event2': [0,1,0,1,0,1,0,1]}
df = pd.DataFrame(data=d)

步骤2：分组日期和设备

df2=df.groupby(['device','date']).sum()

步骤3：计算转化

df2['outcome']=df2.apply(lambda x: 0 if ((df2['event1']==1).sum())==0
                         else (0 if x['event1']==0 else x['event2'] / x['event1']), axis=1)

Answer 2

即使@Tox答案在我发布的玩具示例中得到了解决，但在更大的数据集中却没有。问题是它在检查给定行是否包含两个事件之前进行分组。以下作品：

select 
  case when c.column1 is null or c.column2 is null then d.column3 else c.column1 end,
  case when c.column1 is null or c.column2 is null then d.column4 else c.column2 end
FROM table1 a 
JOIN table2 b ON a.id=b.id 
JOIN table3 c ON b.tabid = c.tabid
LEFT JOIN table4 d ON c.pmid=d.pmid 
WHERE a.id = @id

熊猫-如何根据多个二进制列中的条件对比率进行分组？

2 个答案: