如何将记录从两个类别重新分类为四个类别

时间:2016-11-19 11:06:37

标签: python algorithm dataframe subset

我有一个pandas数据框,其中包含数百万客户的产品名称[a,b,c,d,e,f,j,h,i,j,k,l]。 对于每个产品,数据报告客户在当月使用产品(表示为1)或未使用(表示为0)。

客户的原始分类:1表示使用,0表示不使用
我想将产品用途重新分类为四类:

S:用过    M:维持使用(随后几个月使用)
   N:没用过    D:维持未使用(连续几个月未使用)

原始数据如下所示:

+-------------+-------+---+---+---+---+---+---+---+---+---+---+---+---+
| Customer_ID | Month | a | b | c | d | e | f | j | h | i | j | k | l |
+-------------+-------+---+---+---+---+---+---+---+---+---+---+---+---+
| 19509       | Jan   | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19509       | Feb   | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 19509       | Mar   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19509       | Apr   | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19509       | May   | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19509       | Jun   | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19509       | Jul   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19509       | Aug   | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19509       | Sep   | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 19510       | Jan   | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19510       | Feb   | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 19510       | Mar   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19510       | Apr   | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19510       | May   | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19510       | Jun   | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19510       | Jul   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19510       | Aug   | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19510       | Sep   | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 19511       | Jan   | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19511       | Feb   | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 19511       | Mar   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19511       | Apr   | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19511       | May   | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19511       | Jun   | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 19511       | Jul   | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 19511       | Aug   | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19511       | Sep   | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
+-------------+-------+---+---+---+---+---+---+---+---+---+---+---+---+

我想将客户重新分类为四类,以考虑那些维持使用或维持不使用数月的人。

结果应如下所示:

+-------------+-------+---+---+---+---+---+---+---+---+---+---+---+---+
| Customer_ID | Month | a | b | c | d | e | f | j | h | i | j | k | l |
+-------------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 19509       | Jan   | S | N | S | N | N | S | N | S | N | S | S | N |
| 19509       | Feb   | M | N | N | D | D | M | D | M | D | N | M | D |
| 19509       | Mar   | M | S | S | D | D | M | D | M | D | S | M | D |
| 19509       | Apr   | N | M | N | S | D | M | D | M | D | N | N | D |
| 19509       | May   | D | N | D | M | S | M | D | M | D | D | D | D |
| 19509       | Jun   | D | D | D | M | N | M | D | M | D | D | D | D |
| 19509       | Jul   | S | S | S | N | D | M | D | M | D | S | S | D |
| 19509       | Aug   | N | M | N | D | D | M | D | N | D | N | N | D |
| 19509       | Sep   | S | M | S | S | D | M | D | D | S | S | S | D |
| 19510       | Jan   | S | N | S | N | N | S | N | S | N | S | S | N |
| 19510       | Feb   | M | N | N | D | D | M | D | M | D | N | M | D |
| 19510       | Mar   | M | S | S | D | D | M | D | M | D | S | M | D |
| 19510       | Apr   | N | M | N | S | D | M | D | M | D | N | N | D |
| 19510       | May   | D | N | D | M | S | M | D | M | D | D | D | D |
| 19510       | Jun   | D | D | D | M | N | M | D | M | D | D | D | D |
| 19510       | Jul   | S | S | S | N | D | M | D | M | D | S | S | D |
| 19510       | Aug   | N | M | N | D | D | M | D | N | D | N | N | D |
| 19510       | Sep   | S | M | S | S | D | M | D | D | S | S | S | D |
| 19511       | Jan   | S | N | S | N | N | S | N | S | N | S | S | N |
| 19511       | Feb   | M | N | N | D | D | M | D | M | D | N | M | D |
| 19511       | Mar   | M | S | S | D | D | M | D | M | D | S | M | D |
| 19511       | Apr   | N | M | N | S | D | M | D | M | D | N | N | D |
| 19511       | May   | D | N | D | M | S | M | D | M | D | D | D | D |
| 19511       | Jun   | D | D | D | M | N | M | D | M | D | D | D | D |
| 19511       | Jul   | S | S | S | N | D | M | D | M | D | S | S | D |
| 19511       | Aug   | N | M | N | D | D | M | D | N | D | N | N | D |
| 19511       | Sep   | S | M | S | S | D | M | D | D | S | S | S | D |
+-------------+-------+---+---+---+---+---+---+---+---+---+---+---+---+

这样做的算法似乎很复杂,我仍然在考虑适当的顺序。

我想为所有客户和所有产品(列)做这件事,我想我们可以这样开始:

for i in customer_ID:
  for j in df.columns:

注意:这种情况不是使用和非使用情况,而是join(1),cancel(0),keep idle(0)和if again(1)等等。因此,当它为零时,意味着客户取消了服务,当它在接下来的三个月内为零时,意味着他不是客户,然后他加入并再次取消,我们应该知道他取消服务的次数。如果我们只计算总数,则不会告诉我们客户加入的次数以及他取消特定产品或服务的次数。

我很感激任何建议或想法来解决这个问题。

2 个答案:

答案 0 :(得分:0)

为了简单起见,我解释了如何为一个客户和一个产品执行此操作,然后您可以为每个客户和列执行此操作:

  1. 找到最早的条目(如果你在11月份这样做,那么你可以先查看12月,1月,2月等的值,直到找到值)并应用新值:

    • 0 => Ñ
    • 1 =>小号
  2. 对于下一个(最多11个)条目,您可以根据之前的值以及此处标有f(old, val)的列中的内容应用值:

    • f(N,0)=> d
    • f(N,1)=>小号
    • f(S,0)=> Ñ
    • f(S,1)=>中号
    • f(M,0)=> Ñ
    • f(M,1)=>中号
    • f(D,0)=> d
    • f(D,1)=>小号
  3. 在这种情况下,这可以简化(N / D和S / M产生相同的结果,只需查看前一个值而不是前一个状态),但是如果你有更复杂的状态转换,那么它也许不能,所以我写出来表明这个想法。

答案 1 :(得分:0)

提示:

Prefix sum

  • 如果增加 - 使用
  • 如果增加期限较长,但是,12月,总和超过阈值 - 维持使用

你可以计算其余部分。

Kadane's algoritm - 最大子阵列 - 如果你使用+1标记,不使用-1,这个将告诉你使用普遍超过不使用的最长时间。