Question

我有一个像下面这样的数据集

ID. Invoice. Date of Invoice.  paid or not.  

1    1         10/31/2019       yes
1    1         10/31/2019       yes
1    2         11/30/2019       no
1    3         12/31/2019       no

2    1         09/30/2019       no
2    2         10/30/2019       no
2    3         11/30/2019       yes

3    1         7/31/2019        no
3    2         9/30/2019        yes
3    3         12/31/2019       no

我想知道客户是否愿意付款。只要客户支付了新发票而未支付的旧发票，我就会给他一个很好的分数。因此，对于客户2和客户3，我给的评价是“好”，客户2的评价是“差”。

因此最终数据将再增加一列，其值为好和坏。

ID. Invoice. Date of Invoice.  paid or not.  Bad or good

1    1         10/31/2019       yes          bad
1    1         10/31/2019       yes          bad
1    2         11/30/2019       no           bad
1    3         12/31/2019       no           bad

2    1         09/30/2019       no           good
2    2         10/30/2019       no           good
2    3         11/30/2019       yes          good

3    1         7/31/2019        no           good
3    2         9/30/2019        yes          good
3    3         12/31/2019       no           good

Answer 1

您的数据：

df = structure(list(ID. = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Invoice. = c(1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Date.of.Invoice. = structure(c(3L, 
3L, 4L, 5L, 1L, 2L, 4L, 6L, 7L, 5L), .Label = c("09/30/2019", 
"10/30/2019", "10/31/2019", "11/30/2019", "12/31/2019", "7/31/2019", 
"9/30/2019"), class = "factor"), paid.or.not. = structure(c(2L, 
2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

您可以尝试以下操作：

label_func = function(i){
if (all(i==2)) {
"good"
} else if (any(diff(i)>0)) {
"good"
} else{"bad"}
}

library(dplyr)
df$paid.or.not. = factor(df$paid.or.not.,levels=c("no","yes"))
df %>% group_by(ID.) %>% 
mutate(score=label_func(as.numeric(paid.or.not.)))

# A tibble: 10 x 5
# Groups:   ID. [3]
     ID. Invoice. Date.of.Invoice. paid.or.not. score
   <int>    <int> <fct>            <fct>        <chr>
 1     1        1 10/31/2019       yes          bad  
 2     1        1 10/31/2019       yes          bad  
 3     1        2 11/30/2019       no           bad  
 4     1        3 12/31/2019       no           bad  
 5     2        1 09/30/2019       no           good 
 6     2        2 10/30/2019       no           good 
 7     2        3 11/30/2019       yes          good 
 8     3        1 7/31/2019        no           good 
 9     3        2 9/30/2019        yes          good 
10     3        3 12/31/2019       no           good

说明其工作方式。在您的数据框中，已付款或未付款列。通常被编码为一个因子。在上面的代码中，我执行了它，并将“ no”设置为第一个，将“ yes”设置为第二个。如果我们对此列进行as.numeric()：

df %>% mutate(score=as.numeric(paid.or.not.))
   ID. Invoice. Date.of.Invoice. paid.or.not. score
1    1        1       10/31/2019          yes     2
2    1        1       10/31/2019          yes     2
3    1        2       11/30/2019           no     1
4    1        3       12/31/2019           no     1
5    2        1       09/30/2019           no     1
6    2        2       10/30/2019           no     1
7    2        3       11/30/2019          yes     2
8    3        1        7/31/2019           no     1
9    3        2        9/30/2019          yes     2
10   3        3       12/31/2019           no     1

我们可以看到它得到1或2。如果在“否”之后有“是”，则将其标记为好，这意味着它们的差为+1。

我们可以这样看：

df %>% mutate(score=as.numeric(paid.or.not.)-lag(as.numeric(paid.or.not.)))

   ID. Invoice. Date.of.Invoice. paid.or.not. score
1    1        1       10/31/2019          yes    NA
2    1        1       10/31/2019          yes     0
3    1        2       11/30/2019           no    -1
4    1        3       12/31/2019           no     0
5    2        1       09/30/2019           no     0
6    2        2       10/30/2019           no     0
7    2        3       11/30/2019          yes     1
8    3        1        7/31/2019           no    -1
9    3        2        9/30/2019          yes     1
10   3        3       12/31/2019           no    -1

您可以看到那些要标记为“好”的标签至少有+1，而那些“不好的”标签则没有“ +1”。最后一个例外是，如果全部为“是”而全部为“否”：

test=data.frame(ID.=1:2,Invoice.=1,
Date.of.Invoice.="12/31/2019",paid.or.not.=c("yes","no"))
test$paid.or.not. = factor(test$paid.or.not.,levels=c("no","yes"))
test %>% group_by(ID.) %>% 
mutate(score=label_func(as.numeric(paid.or.not.)))

# A tibble: 2 x 5
# Groups:   ID. [2]
    ID. Invoice. Date.of.Invoice. paid.or.not. score
  <int>    <dbl> <fct>            <fct>        <chr>
1     1        1 12/31/2019       yes          good 
2     2        1 12/31/2019       no           bad

根据另一个变量值的时间生成一个新变量

1 个答案: