我有一个像下面这样的数据集
ID. Invoice. Date of Invoice. paid or not.
1 1 10/31/2019 yes
1 1 10/31/2019 yes
1 2 11/30/2019 no
1 3 12/31/2019 no
2 1 09/30/2019 no
2 2 10/30/2019 no
2 3 11/30/2019 yes
3 1 7/31/2019 no
3 2 9/30/2019 yes
3 3 12/31/2019 no
我想知道客户是否愿意付款。只要客户支付了新发票而未支付的旧发票,我就会给他一个很好的分数。因此,对于客户2和客户3,我给的评价是“好”,客户2的评价是“差”。
因此最终数据将再增加一列,其值为好和坏。
ID. Invoice. Date of Invoice. paid or not. Bad or good
1 1 10/31/2019 yes bad
1 1 10/31/2019 yes bad
1 2 11/30/2019 no bad
1 3 12/31/2019 no bad
2 1 09/30/2019 no good
2 2 10/30/2019 no good
2 3 11/30/2019 yes good
3 1 7/31/2019 no good
3 2 9/30/2019 yes good
3 3 12/31/2019 no good
答案 0 :(得分:0)
您的数据:
df = structure(list(ID. = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Invoice. = c(1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Date.of.Invoice. = structure(c(3L,
3L, 4L, 5L, 1L, 2L, 4L, 6L, 7L, 5L), .Label = c("09/30/2019",
"10/30/2019", "10/31/2019", "11/30/2019", "12/31/2019", "7/31/2019",
"9/30/2019"), class = "factor"), paid.or.not. = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
您可以尝试以下操作:
label_func = function(i){
if (all(i==2)) {
"good"
} else if (any(diff(i)>0)) {
"good"
} else{"bad"}
}
library(dplyr)
df$paid.or.not. = factor(df$paid.or.not.,levels=c("no","yes"))
df %>% group_by(ID.) %>%
mutate(score=label_func(as.numeric(paid.or.not.)))
# A tibble: 10 x 5
# Groups: ID. [3]
ID. Invoice. Date.of.Invoice. paid.or.not. score
<int> <int> <fct> <fct> <chr>
1 1 1 10/31/2019 yes bad
2 1 1 10/31/2019 yes bad
3 1 2 11/30/2019 no bad
4 1 3 12/31/2019 no bad
5 2 1 09/30/2019 no good
6 2 2 10/30/2019 no good
7 2 3 11/30/2019 yes good
8 3 1 7/31/2019 no good
9 3 2 9/30/2019 yes good
10 3 3 12/31/2019 no good
说明其工作方式。在您的数据框中,已付款或未付款列。通常被编码为一个因子。在上面的代码中,我执行了它,并将“ no”设置为第一个,将“ yes”设置为第二个。如果我们对此列进行as.numeric()
:
df %>% mutate(score=as.numeric(paid.or.not.))
ID. Invoice. Date.of.Invoice. paid.or.not. score
1 1 1 10/31/2019 yes 2
2 1 1 10/31/2019 yes 2
3 1 2 11/30/2019 no 1
4 1 3 12/31/2019 no 1
5 2 1 09/30/2019 no 1
6 2 2 10/30/2019 no 1
7 2 3 11/30/2019 yes 2
8 3 1 7/31/2019 no 1
9 3 2 9/30/2019 yes 2
10 3 3 12/31/2019 no 1
我们可以看到它得到1或2。如果在“否”之后有“是”,则将其标记为好,这意味着它们的差为+1。
我们可以这样看:
df %>% mutate(score=as.numeric(paid.or.not.)-lag(as.numeric(paid.or.not.)))
ID. Invoice. Date.of.Invoice. paid.or.not. score
1 1 1 10/31/2019 yes NA
2 1 1 10/31/2019 yes 0
3 1 2 11/30/2019 no -1
4 1 3 12/31/2019 no 0
5 2 1 09/30/2019 no 0
6 2 2 10/30/2019 no 0
7 2 3 11/30/2019 yes 1
8 3 1 7/31/2019 no -1
9 3 2 9/30/2019 yes 1
10 3 3 12/31/2019 no -1
您可以看到那些要标记为“好”的标签至少有+1,而那些“不好的”标签则没有“ +1”。最后一个例外是,如果全部为“是”而全部为“否”:
test=data.frame(ID.=1:2,Invoice.=1,
Date.of.Invoice.="12/31/2019",paid.or.not.=c("yes","no"))
test$paid.or.not. = factor(test$paid.or.not.,levels=c("no","yes"))
test %>% group_by(ID.) %>%
mutate(score=label_func(as.numeric(paid.or.not.)))
# A tibble: 2 x 5
# Groups: ID. [2]
ID. Invoice. Date.of.Invoice. paid.or.not. score
<int> <dbl> <fct> <fct> <chr>
1 1 1 12/31/2019 yes good
2 2 1 12/31/2019 no bad