我有一个复杂的问题要解决。
假设我有这个数据集
Id Name Price sales Profit Month Category Mode Supplier
1 A 0 0 0 1 X K John
1 A 0 0 0 2 X K John
1 A 0 0 0 3 X K John
1 A 2 5 0 4 X L Sam
1 A 2 3 4 5 X L Sam
1 A 0 0 0 6 X L Sam
2 C 2 4 9 1 X M John
2 C 0 0 0 2 X L John
2 C 0 0 0 3 X K John
2 C 2 8 0 4 Y M John
2 C 2 8 10 5 Y K John
2 C 0 0 0 6 Y K John
3 E 0 0 0 1 Y M Sam
3 E 0 0 0 2 Y L Sam
3 E 2 5 9 3 Y M Sam
3 E 0 0 0 4 Z M Kyle
3 E 0 0 0 5 Z L Kyle
3 E 0 0 0 6 Z M Kyle
现在,我希望从数据框中删除那些连续三个月Id
和Price, sales
零值的产品profit
。在这种情况下,如何通过Id
预期输出
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 0 4 X L Sam
1 A 2 3 4 5 X L Sam
1 A 0 0 0 6 X L Sam
2 C 2 4 9 1 X M John
2 C 0 0 0 2 X L John
2 C 0 0 0 3 X K John
2 C 2 8 0 4 Y M John
2 C 2 8 10 5 Y K John
2 C 0 0 0 6 Y K John
3 E 0 0 0 1 Y M Sam
3 E 0 0 0 2 Y L Sam
3 E 2 5 9 3 Y M Sam
这只是一个可重复的样本,我的原始数据有超过800k行。所以我正在寻找一些可以在大型数据集上实现这一功能的功能。
我曾尝试使用之前提到的方法
library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(Price == 0 & sales == 0 & Profit == 0))][
!(Price==0 & sales == 0 & Profit == 0 & N >= 2)]
当我尝试收到错误'could not find rleid function'
并且我已安装data.table
软件包
答案 0 :(得分:0)
它相当“自制”但也许会有所帮助(我的例子有点简单,但想法是一样的):
library("dplyr")
# just an example:
month <- rep(1:7, 3)
id <- rep(c("A", "C", "E"), each=7)
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1)
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1)
supplier <- rep(c("john", "anna", "ben"), 7)
data.frame(id, price, sales, month, supplier) -> dane
# lag from a vector shows everything but first element and first element become NA:
lag1_sales <- lag(dane$sales)
lag2_sales <- lag(dane$sales, 2) # the same, but without two first elements
lag1_price <- lag(dane$price)
lag2_price <- lag(dane$price, 2)
# I add it to data_frame as columns:
dane <- cbind(dane, lag1_sales, lag2_sales, lag1_price, lag2_price)
# mutate creates new column with 1 if sales and price and it's two lags are equal 1, so that I have a marker when was three zeros:
dane %>%
mutate(marker=ifelse(sales==0 & price==0 &
lag1_sales==0 & lag2_sales==0 &
lag1_price==0 & lag2_price==0, 1, 0)) -> dane
# marker2 and marker3 are made to marker two rows above this triple markered above:
marker2 <- c(dane$marker[-1], NA)
marker3 <- c(dane$marker[-c(1, 2)], NA, NA)
dane <- cbind(dane, marker2, marker3)
# I take only rows, which are marked:
dane %>%
filter(!(marker==1 | marker2==1 | marker3==1)) -> new_data
答案 1 :(得分:0)
这是我的答案。即使连续三个月(例如months: 2,5,6
#Generate data
month <- rep(1:7, 3)
id <- rep(c("1", "2", "3"), each=7)
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1)
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1)
test <- data.frame(id, price, sales, month)
#Calculate how many consecutive times a combination of id,
#price & sales is encountered
sequence <- rle(paste(test$id,test$price,test$sales,sep=""))
#calculate the row indexes to keep
index <- with(sequence, lengths != 3 )
index2 <- unlist(sapply(c(1:length(index)),FUN=function(x){
seq(from=index[x],to=index[x],length.out=sequence$lengths[x])
}))
#store results:
test2 <- test[index2,]