Question

我有一个复杂的问题要解决。

假设我有这个数据集

Id Name Price sales Profit Month Category Mode Supplier
1    A     0     0      0     1        X    K     John
1    A     0     0      0     2        X    K     John
1    A     0     0      0     3        X    K     John
1    A     2     5      0     4        X    L      Sam
1    A     2     3      4     5        X    L      Sam
1    A     0     0      0     6        X    L      Sam
2    C     2     4      9     1        X    M     John
2    C     0     0      0     2        X    L     John
2    C     0     0      0     3        X    K     John
2    C     2     8      0     4        Y    M     John
2    C     2     8     10     5        Y    K     John
2    C     0     0      0     6        Y    K     John
3    E     0     0      0     1        Y    M      Sam
3    E     0     0      0     2        Y    L      Sam
3    E     2     5      9     3        Y    M      Sam
3    E     0     0      0     4        Z    M     Kyle
3    E     0     0      0     5        Z    L     Kyle
3    E     0     0      0     6        Z    M     Kyle

现在，我希望从数据框中删除那些连续三个月Id和Price, sales零值的产品profit。在这种情况下，如何通过Id

删除仅在某些组中的行

预期输出

Id Name Price sales Profit Month Category Mode Supplier
1    A     2     5      0     4        X    L      Sam
1    A     2     3      4     5        X    L      Sam
1    A     0     0      0     6        X    L      Sam
2    C     2     4      9     1        X    M     John
2    C     0     0      0     2        X    L     John
2    C     0     0      0     3        X    K     John
2    C     2     8      0     4        Y    M     John
2    C     2     8     10     5        Y    K     John
2    C     0     0      0     6        Y    K     John
3    E     0     0      0     1        Y    M      Sam
3    E     0     0      0     2        Y    L      Sam
3    E     2     5      9     3        Y    M      Sam

这只是一个可重复的样本，我的原始数据有超过800k行。所以我正在寻找一些可以在大型数据集上实现这一功能的功能。

我曾尝试使用之前提到的方法

library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(Price == 0 & sales == 0 & Profit == 0))][
    !(Price==0 & sales == 0 & Profit == 0 & N >= 2)]

当我尝试收到错误'could not find rleid function'并且我已安装data.table软件包

时 P.S我之前已经问过这个问题，而其他帖子中的几个解决方案只对小数据有效，并且没有得到可以解决大数据集上的这类问题的答案，这就是为什么我要再问一次。

Answer 1

它相当“自制”但也许会有所帮助（我的例子有点简单，但想法是一样的）：

library("dplyr")

# just an example:

month <- rep(1:7, 3)
id <- rep(c("A", "C", "E"), each=7)
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1)
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1)
supplier <- rep(c("john", "anna", "ben"), 7)

data.frame(id, price, sales, month, supplier) -> dane

# lag from a vector shows everything but first element and first element become NA:

lag1_sales <- lag(dane$sales)
lag2_sales <- lag(dane$sales, 2) # the same, but without two first elements

lag1_price <- lag(dane$price)
lag2_price <- lag(dane$price, 2)

# I add it to data_frame as columns:

dane <- cbind(dane, lag1_sales, lag2_sales, lag1_price, lag2_price)

# mutate creates new column with 1 if sales and price and it's two lags are equal 1, so that I have a marker when was three zeros:

dane %>% 
    mutate(marker=ifelse(sales==0 & price==0 & 
                             lag1_sales==0 & lag2_sales==0 &
                             lag1_price==0 & lag2_price==0, 1, 0)) -> dane

# marker2 and marker3 are made to marker two rows above this triple markered above:

marker2 <- c(dane$marker[-1], NA)
marker3 <- c(dane$marker[-c(1, 2)], NA, NA)

dane <- cbind(dane, marker2, marker3)

# I take only rows, which are marked:

dane %>% 
    filter(!(marker==1 | marker2==1 | marker3==1)) -> new_data

Answer 2

这是我的答案。即使连续三个月（例如months: 2,5,6

），此代码也会删除行

#Generate data
month <- rep(1:7, 3)
id <- rep(c("1", "2", "3"), each=7)
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1)
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1)
test <- data.frame(id, price, sales, month)

#Calculate how many consecutive times a combination of id, 
#price & sales is encountered
sequence <- rle(paste(test$id,test$price,test$sales,sep=""))

#calculate the row indexes to keep
index <- with(sequence, lengths != 3 )
index2 <- unlist(sapply(c(1:length(index)),FUN=function(x){
  seq(from=index[x],to=index[x],length.out=sequence$lengths[x])
}))

#store results:
test2 <- test[index2,]

删除R

2 个答案: