R根据先前行中的值删除行

时间:2016-07-29 15:29:46

标签: r

我是R的新手,并尝试根据之前行的值删除行。样本数据:

Cust_ID | Date                 | Value
500219  | 2016-04-11 12:00:00  | 0
500219  | 2016-04-12 16:00:00  | 0
500219  | 2016-04-14 11:00:00  | 1
500219  | 2016-04-15 12:00:00  | 1
500219  | 2016-05-23 09:00:00  | 0
500219  | 2016-05-02 19:00:00  | 0
500220  | 2016-04-11 12:00:00  | 0
500220  | 2016-04-14 11:00:00  | 1
500220  | 2016-04-15 12:00:00  | 1
500220  | 2016-05-23 09:00:00  | 0
500220  | 2016-05-02 19:00:00  | 0

对于每个给出结果的Cust_ID,我想只保留Value = 1之前的行:

Cust_ID | Date                 | Value
500219  | 2016-04-11 12:00:00  | 0
500219  | 2016-04-12 16:00:00  | 0
500219  | 2016-04-14 11:00:00  | 1
500219  | 2016-04-15 12:00:00  | 1
500220  | 2016-04-11 12:00:00  | 0
500220  | 2016-04-14 11:00:00  | 1
500220  | 2016-04-15 12:00:00  | 1

任何帮助将不胜感激!

3 个答案:

答案 0 :(得分:2)

这是一个split-apply-combine方法,它保留每个客户的1和1之前的值。

# split data by customer ID
myList <- split(df, df$Cust_ID)
# loop through ID list, drop desired rows, rbind resulting list
dfNew <- do.call(rbind, lapply(myList, function(i) {
                               drop <- which(i$Value==1)
                               i[c(1:drop[1], drop[-1]),]}))

返回

dfNew
         Cust_ID                   Date Value
500219.1  500219  2016-04-11 12:00:00       0
500219.2  500219  2016-04-12 16:00:00       0
500219.3  500219  2016-04-14 11:00:00       1
500219.4  500219  2016-04-15 12:00:00       1
500220.7  500220  2016-04-11 12:00:00       0
500220.8  500220  2016-04-14 11:00:00       1
500220.9  500220  2016-04-15 12:00:00       1

请注意,如果客户ID的值不等于1,则此解决方案将无效。

如果您想保留从未达到1阈值的观察,请使用

dfNew <- do.call(rbind, lapply(myList, function(i) {
                               drop <- which(i$Value==1)
                               if(length(drop) != 0) i[c(1:drop[1], drop[-1]),]
                               else i}))

答案 1 :(得分:2)

我们可以使用data.table。将'data.frame'转换为'data.table'(setDT(df1)),按'Cust_ID'分组,我们得到max索引的序列,其中'Value'为1,并得到该行index(.I)并使用它来对data.table行进行子集化。

library(data.table)
setDT(df1)[df1[,  if(any(Value == 1)) .I[seq(max(which(Value == 1)))]
                                 else .I[1:.N] , by = Cust_ID]$V1]
#      Cust_ID                Date Value
#1:  500219 2016-04-11 12:00:00     0
#2:  500219 2016-04-12 16:00:00     0
#3:  500219 2016-04-14 11:00:00     1
#4:  500219 2016-04-15 12:00:00     1
#5:  500220 2016-04-11 12:00:00     0
#6:  500220 2016-04-14 11:00:00     1
#7:  500220 2016-04-15 12:00:00     1

或使用与dplyr

类似的方法
library(dplyr)
df1 %>% 
     group_by(Cust_ID) %>% 
     slice(if(any(Value==1)) seq(max(which(Value==1))) else row_number())
#   Cust_ID                Date Value
#     <int>               <chr> <int>
#1  500219 2016-04-11 12:00:00     0
#2  500219 2016-04-12 16:00:00     0
#3  500219 2016-04-14 11:00:00     1
#4  500219 2016-04-15 12:00:00     1
#5  500220 2016-04-11 12:00:00     0
#6  500220 2016-04-14 11:00:00     1
#7  500220 2016-04-15 12:00:00     1

答案 2 :(得分:0)

循环方式:

cust <- 0
keep <- FALSE
keepers <- vector(mode = "logical", length = nrow(df))

## walk through the dataframe backwards
for(rec in nrow(df):1)
{
  ## have we been working with this customer?
  if(df[rec,]$Cust_ID == cust)
  {
    if(df[rec,]$Value == 1  | keep == TRUE)
    {
      keepers[rec] = TRUE
      keep <- TRUE
    }
  }
  else
  {
    cust = df[rec,]$Cust_ID
    if(df[rec,]$Value == 1)
    {
      keepers[rec] = TRUE
      keep <- TRUE
    }
    else
    {
      keep <- FALSE
    }
  }
}

df <- df[keepers,]
df