我是R的新手,并尝试根据之前行的值删除行。样本数据:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500219 | 2016-05-23 09:00:00 | 0
500219 | 2016-05-02 19:00:00 | 0
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
500220 | 2016-05-23 09:00:00 | 0
500220 | 2016-05-02 19:00:00 | 0
对于每个给出结果的Cust_ID,我想只保留Value = 1之前的行:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
任何帮助将不胜感激!
答案 0 :(得分:2)
这是一个split-apply-combine方法,它保留每个客户的1和1之前的值。
# split data by customer ID
myList <- split(df, df$Cust_ID)
# loop through ID list, drop desired rows, rbind resulting list
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
i[c(1:drop[1], drop[-1]),]}))
返回
dfNew
Cust_ID Date Value
500219.1 500219 2016-04-11 12:00:00 0
500219.2 500219 2016-04-12 16:00:00 0
500219.3 500219 2016-04-14 11:00:00 1
500219.4 500219 2016-04-15 12:00:00 1
500220.7 500220 2016-04-11 12:00:00 0
500220.8 500220 2016-04-14 11:00:00 1
500220.9 500220 2016-04-15 12:00:00 1
请注意,如果客户ID的值不等于1,则此解决方案将无效。
如果您想保留从未达到1阈值的观察,请使用
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
if(length(drop) != 0) i[c(1:drop[1], drop[-1]),]
else i}))
答案 1 :(得分:2)
我们可以使用data.table
。将'data.frame'转换为'data.table'(setDT(df1)
),按'Cust_ID'分组,我们得到max
索引的序列,其中'Value'为1,并得到该行index(.I
)并使用它来对data.table行进行子集化。
library(data.table)
setDT(df1)[df1[, if(any(Value == 1)) .I[seq(max(which(Value == 1)))]
else .I[1:.N] , by = Cust_ID]$V1]
# Cust_ID Date Value
#1: 500219 2016-04-11 12:00:00 0
#2: 500219 2016-04-12 16:00:00 0
#3: 500219 2016-04-14 11:00:00 1
#4: 500219 2016-04-15 12:00:00 1
#5: 500220 2016-04-11 12:00:00 0
#6: 500220 2016-04-14 11:00:00 1
#7: 500220 2016-04-15 12:00:00 1
或使用与dplyr
library(dplyr)
df1 %>%
group_by(Cust_ID) %>%
slice(if(any(Value==1)) seq(max(which(Value==1))) else row_number())
# Cust_ID Date Value
# <int> <chr> <int>
#1 500219 2016-04-11 12:00:00 0
#2 500219 2016-04-12 16:00:00 0
#3 500219 2016-04-14 11:00:00 1
#4 500219 2016-04-15 12:00:00 1
#5 500220 2016-04-11 12:00:00 0
#6 500220 2016-04-14 11:00:00 1
#7 500220 2016-04-15 12:00:00 1
答案 2 :(得分:0)
循环方式:
cust <- 0
keep <- FALSE
keepers <- vector(mode = "logical", length = nrow(df))
## walk through the dataframe backwards
for(rec in nrow(df):1)
{
## have we been working with this customer?
if(df[rec,]$Cust_ID == cust)
{
if(df[rec,]$Value == 1 | keep == TRUE)
{
keepers[rec] = TRUE
keep <- TRUE
}
}
else
{
cust = df[rec,]$Cust_ID
if(df[rec,]$Value == 1)
{
keepers[rec] = TRUE
keep <- TRUE
}
else
{
keep <- FALSE
}
}
}
df <- df[keepers,]
df