Question

我有一个带有呼叫中心数据的熊猫数据框。数据框如下所示：

    member_id  survey_score  call_reason  call_direction      time_stamp
0     bob13         0          returns       inbound      2019-03-18 10:12:00
1     ub40          5         complaint      inbound      2019-03-19 11:12:00
2     bob13         7          returns       outbound     2019-03-19 09:15:00
3     todd100       3         order_error    inbound      2019-03-20 10:15:00
4     ub40          2         complaint      inbound      2019-03-21 12:11:00
5     todd100       7         order_error    outbound     2019-03-22 08:10:00
6     ub40          1         complaint      outbound     2019-03-22 11:09:00
7     ron34         6         exchange       inbound      2019-03-22 13:09:00
8     ron34         7         returns         inbound      2019-03-24 15:03:00

我正在寻找的输出如下：

    member_id    call_reason     score_differential          
0     bob13       returns               7
1     ub40       complaint             -1
2     todd100    order_error            4

因此，基本上，我希望得到一个成员的第一个呼入电话调查得分与该成员的下一个呼出电话调查得分之间的差异，并且前提是呼叫原因也相同。

作为一个小企业主，我正在尝试为公司自己做数据科学工作，以节省一些时间。不幸的是，我在这方面是个新手，对此我遇到了很大的困难。任何帮助将不胜感激！

注意：我正在通过anaconda在本地计算机上使用jupyter笔记本和熊猫。

请帮助我以更快，更轻松，更合乎逻辑的方式执行此操作。

我尝试了很多方法来使输出正确，但是我仍然遇到很大的困难。我觉得事情太复杂了。首先，我得到了电话订单。然后，我为第一得分入站呼叫得分和得分差异创建一些列。然后，我获得了要迭代的所有唯一成员Id的列表，最后，我用一堆逻辑弄成了一个巨大的循环，使我迷路了。

此外，在此代码的第一次迭代中，我没有考虑调用方向。此外，我得到了具有相同呼叫原因的成员的所有后续呼叫的平均值，然后得出了该成员与第一个呼叫之间的差额。我不再想要那样。

df['call_order'] = df_repeat.groupby('member_id')['timestamp'].rank(ascending=True, method = 'dense')

df["first_call_survey_score"] = ""
df["first_call_survey_score"] = np.nan
df["score_differential"] = ""
df["score_differential"] = np.nan

member_list = df['member_id'].unique()

unscorable = 0
for member in member_list:
    try:
        count = 2
        temp = df.loc[df['member_id'] == member]
        temp = temp.drop_duplicates(subset='call_order', keep="first")
        num_calls = temp['member_id'].count()
        first_call = temp.query("call_order == 1")
        first_survey_score = first_call['survey_score'].values[0]
        reason = first_call['call_reason'].values[0]
        sumscore = 0
        legit_call_count = 0
        while count <= num_calls:
                next_call = temp.query("call_order == @count")
                if reason == next_call['call_reason'].values[0]:
                    sumscore = sumscore + next_call['survey_score'].values[0]
                    count = count + 1
                    legit_call_count = legit_call_count + 1 
                elif reason != next_call['call_reason'].values[0] and count == num_calls:
                    count = 20
                elif reason != next_call['call_reason'].values[0]:
                    count = count + 1
                    next_call = temp.query("call_order == @count")
                    reason = next_call['call_reason'].values[0]
                    first_survey_score = next_call['survey_score'].values[0]
                else: count = count + 1

        if legit_call_count == 1:
            df.loc[((df_repeat['member_id'] == member)),['score_differential']] = sumscore / legit_call_count - first_survey_score
        elif count == 20:unscorable = unscorable + 1
        else: 
            df.loc[((df['member_id'] == member)),['score_differential']] = sumscore / legit_call_count - first_survey_score
    except Exception as exception:
            unscorable = unscorable + 1

print(unscorable, "Callers could not be scored")

Answer 1

这是一种方法，其中传出的呼叫由成员/原因赋予唯一的ID，然后该ID回填到传入的呼叫中。然后，将给定（成员，原因，Id）的最后一个传入呼叫与相同（成员，原因，Id）的传出呼叫配对，并计算出差值。注意：我为用户bob13添加了第二个呼叫序列，以表明它可以处理同一用户的多个呼叫。

txt = """\
   member_id  survey_score  call_reason  call_direction      time_stamp
     bob13         0          returns       inbound      2019-03-18T10:12:00
     ub40          5         complaint      inbound      2019-03-19T11:12:00
     bob13         7          returns       outbound     2019-03-19T09:15:00
     todd100       3         order_error    inbound      2019-03-20T10:15:00
     ub40          2         complaint      inbound      2019-03-21T12:11:00
     todd100       7         order_error    outbound     2019-03-22T08:10:00
     ub40          1         complaint      outbound     2019-03-22T11:09:00
     ron34         6         exchange       inbound      2019-03-22T13:09:00
     ron34         7         returns         inbound      2019-03-24T15:03:00
     bob13         2          returns       inbound      2019-03-25T10:12:00
     bob13         3          returns       outbound     2019-03-27T09:15:00
"""
df = pd.read_csv(io.StringIO(txt), delim_whitespace=1, index_col=False)

grp = df.query('call_direction=="outbound"').\
    groupby(['member_id', 'call_reason'])
df['OutId'] = grp.time_stamp.transform(lambda x: x.rank())
print()
print(df)

grp = df.groupby(['member_id', 'call_reason'])
df['Id'] = grp.OutId.transform(lambda x: x.bfill())
print()
print(df)

inbnd_score = df.query('call_direction=="inbound"').\
    groupby(['member_id', 'call_reason', 'Id']).survey_score.last()
outbnd_score = df.query('call_direction=="outbound"').\
    groupby(['member_id', 'call_reason', 'Id']).survey_score.last()

ddf = pd.concat([inbnd_score, outbnd_score], axis=1,
                keys=['inbnd', 'outbnd'])
ddf['score_differential'] = ddf.outbnd - ddf.inbnd
print()
print(ddf)

输出：

   member_id  survey_score  call_reason call_direction           time_stamp  OutId
0      bob13             0      returns        inbound  2019-03-18T10:12:00    NaN
1       ub40             5    complaint        inbound  2019-03-19T11:12:00    NaN
2      bob13             7      returns       outbound  2019-03-19T09:15:00    1.0
3    todd100             3  order_error        inbound  2019-03-20T10:15:00    NaN
4       ub40             2    complaint        inbound  2019-03-21T12:11:00    NaN
5    todd100             7  order_error       outbound  2019-03-22T08:10:00    1.0
6       ub40             1    complaint       outbound  2019-03-22T11:09:00    1.0
7      ron34             6     exchange        inbound  2019-03-22T13:09:00    NaN
8      ron34             7      returns        inbound  2019-03-24T15:03:00    NaN
9      bob13             2      returns        inbound  2019-03-25T10:12:00    NaN
10     bob13             3      returns       outbound  2019-03-27T09:15:00    2.0

   member_id  survey_score  call_reason call_direction           time_stamp  OutId   Id
0      bob13             0      returns        inbound  2019-03-18T10:12:00    NaN  1.0
1       ub40             5    complaint        inbound  2019-03-19T11:12:00    NaN  1.0
2      bob13             7      returns       outbound  2019-03-19T09:15:00    1.0  1.0
3    todd100             3  order_error        inbound  2019-03-20T10:15:00    NaN  1.0
4       ub40             2    complaint        inbound  2019-03-21T12:11:00    NaN  1.0
5    todd100             7  order_error       outbound  2019-03-22T08:10:00    1.0  1.0
6       ub40             1    complaint       outbound  2019-03-22T11:09:00    1.0  1.0
7      ron34             6     exchange        inbound  2019-03-22T13:09:00    NaN  NaN
8      ron34             7      returns        inbound  2019-03-24T15:03:00    NaN  NaN
9      bob13             2      returns        inbound  2019-03-25T10:12:00    NaN  2.0
10     bob13             3      returns       outbound  2019-03-27T09:15:00    2.0  2.0

                           inbnd  outbnd  score_differential
member_id call_reason Id
bob13     returns     1.0      0       7                   7
                      2.0      2       3                   1
todd100   order_error 1.0      3       7                   4
ub40      complaint   1.0      2       1                  -1

调查某个呼叫类别的第一个呼叫与该呼叫类别的后续呼叫之间的分数差异

1 个答案: