我一直在寻找各地的SO以及其他数据科学和编程博客,但是我没有找到满足我特定需求的答案。因此,如果您发现此问题重复,请保持足够的友善,并向我指出信息来源并关闭/删除该问题。
我的真实数据将有数千行,因此我在这里仅显示一小组虚构的数据,这些数据与我的原始数据非常相似:
Data <- data.frame(
ID = c(1,1,1,1,2,2,2,2,3,3),
Year = c(2014,2015,2016,2017,2007,2008,2009,2010,2016,2017),
CmSm = c(1,2,1,0,1,0,0,1,1,0),
Index = c(1,2,3,4,1,2,3,4,1,2)
)
我最后想要获得的数据集是:
Dataout <- data.frame(
ID = c(1,1,1,1,2,2,2,2,3,3),
Year = c(2014,2015,2016,2017,2007,2008, 2009,2010,2016,2017),
CmSm = c(1,2,1,0,1,0,0,1,1,0),
Index = c(1,2,3,4,1,2,3,4,1,2),
Cassification = c("New", "Existing", "Existing", "Lost", "New", "Lost","","Returning", "New","Lost")
)
到目前为止,我最大的尝试是:
Dataout$Status <- ave( Dataout$CmSm,
Dataout$ID,
FUN = function(x) ifelse( Dataout$Index == 1, "New", ifelse( x[-1] == 0 & x > 0, "Returning", ifelse( x[-1] == 0 & x == 0, "", ifelse( x[-1] > 0 & x == 0, "Lost", "Existing" ) ) ) ) )
但是此尝试有两个问题:
分类错误;
当我在具有成千上万行的原始数据中使用此代码时,R会进行15分钟的计算并且不会返回任何结果(我想ifelse并没有帮助...),提到分配给进程的内存非常高。
对当前问题的解释以及分类规则:
给出项目ID的列表,年份和项目ID的索引,我想将这些项目归类为以下类别:“新建”,“现有”,“返回”,“丢失”和“”或为null或不适用。 此分类的规则如下(CmSm-1代表相对于当前CmSm值的中间先前值):
如果Index == 1,则为“新建”。
如果索引> 1,则:
如果CmSm-1 == 0且CmSm> 0,则“返回”。
如果CmSm-1 == 0且CmSm == 0,则“”->类似于对象未注册事件的情况。
如果CmSm-1> 0并且CmSm> 0,则为“现有”。
如果CmSm-1> 0且CmSm == 0,则“丢失”。
如果让我感到困惑的规则说明,请告诉我,以便我有机会为您澄清它们。
在此先感谢您提供的任何帮助。 干杯!
答案 0 :(得分:3)
为什么不只使用单个向量化条件步骤?
library(dplyr)
Data$Classification <- NA
Data$Classification[Data$Index == 1] <- "New"
Data$Classification[Data$Index > 1 & lag(Data$CmSm) == 0 & Data$CmSm > 0] <- "Returning"
Data$Classification[Data$Index > 1 & lag(Data$CmSm) == 0 & Data$CmSm == 0] <- ""
Data$Classification[Data$Index > 1 & lag(Data$CmSm) > 0 & Data$CmSm > 0] <- "Existing"
Data$Classification[Data$Index > 1 & lag(Data$CmSm) > 0 & Data$CmSm == 0] <- "Lost"
> Data
ID Year CmSm Index Classification
1 1 2014 1 1 New
2 1 2015 2 2 Existing
3 1 2016 1 3 Existing
4 1 2017 0 4 Lost
5 2 2007 1 1 New
6 2 2008 0 2 Lost
7 2 2009 0 3
8 2 2010 1 4 Returning
9 3 2016 1 1 New
10 3 2017 0 2 Lost
这有快如地狱的好处。
与case_when
相比的微基准测试:
Unit: milliseconds
expr min lq mean median uq max neval cld
LAP 1.173902 1.208178 1.580413 1.253404 1.313137 17.07946 100 a
h3rm4n 5.538701 5.732692 7.310704 5.913030 6.138168 50.67234 100 b
答案 1 :(得分:2)
library(dplyr)
Data %>%
mutate(Classification = case_when(
Index == 1 ~ "New",
lag(CmSm) == 0 & CmSm > 0 ~ "Returning",
lag(CmSm) > 0 & CmSm > 0 ~ "Existing",
lag(CmSm) > 0 & CmSm == 0 ~ "Lost",
lag(CmSm) == 0 & CmSm == 0 ~ ""
))
ID Year CmSm Index Classification
1 1 2014 1 1 New
2 1 2015 2 2 Existing
3 1 2016 1 3 Existing
4 1 2017 0 4 Lost
5 2 2007 1 1 New
6 2 2008 0 2 Lost
7 2 2009 0 3
8 2 2010 1 4 Returning
9 3 2016 1 1 New
10 3 2017 0 2 Lost
答案 2 :(得分:2)
对于case_when
中的dplyr
,这是一个很好的情况:
Data %>%
group_by(ID) %>%
mutate(Status = case_when(Index == 1 ~ "New",
lag(CmSm) == 0 & CmSm > 0 ~ "Returning",
lag(CmSm) == 0 & CmSm == 0 ~ "",
lag(CmSm) > 0 & CmSm > 0 ~ "Existing",
lag(CmSm) > 0 & CmSm == 0 ~ "Lost")
)
结果:
# A tibble: 10 x 5
# Groups: ID [3]
ID Year CmSm Index Status
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2014 1 1 New
2 1 2015 2 2 Existing
3 1 2016 1 3 Existing
4 1 2017 0 4 Lost
5 2 2007 1 1 New
6 2 2008 0 2 Lost
7 2 2009 0 3 ""
8 2 2010 1 4 Returning
9 3 2016 1 1 New
10 3 2017 0 2 Lost
答案 3 :(得分:2)
这是另一个基本的R解决方案。它还使用逻辑索引,例如@LAP的解决方案。
我将重新创建Dataout
,因为关注的列是一个因素。
Dataout <- data.frame(
ID = c(1,1,1,1,2,2,2,2,3,3),
Year = c(2014,2015,2016,2017,2007,2008, 2009,2010,2016,2017),
CmSm = c(1,2,1,0,1,0,0,1,1,0),
Index = c(1,2,3,4,1,2,3,4,1,2),
Cassification = c("New", "Existing", "Existing", "Lost", "New", "Lost","","Returning", "New","Lost"),
stringsAsFactors = FALSE
)
inx <- Data$Index == 1
inxCmSm <- Data$CmSm == 0
inxCmSm1 <- c(FALSE, inxCmSm[-length(inxCmSm)])
Data$Status <- ""
Data$Status[inx] <- "New"
Data$Status[!inx & inxCmSm1 & !inxCmSm] <- "Returning"
Data$Status[!inx & !inxCmSm1 & !inxCmSm] <- "Existing"
Data$Status[!inx & !inxCmSm1 & inxCmSm] <- "Lost"
identical(Data$Status, Dataout$Cassification)
#[1] TRUE