我试图根据值的出现来获取数据帧的子集。这在下面给出的一个例子中得到了最好的解释。这个问题与Selecting top finite number of rows for each unique value of a column in a data fame in R有很高的关系 但是,我想改变head()命令选择的项目数。
#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#This next part is a very time consuming method (Have 2 mln rows, 90k different products)
first <- TRUE
for( i in table_input$Var1 ) {
data_selected <- input[input$Product == i,]
number <- table_input[table_input$Var1 == i ,]$twentyfive
head <- head( data_selected, number)
if( first == FALSE) {
output <- rbind(output, head)
} else {
output <- head
}
first <- FALSE
}
希望有人知道更好,更有效的方式。我尝试使用此处答案中的split函数:Selecting top finite number of rows for each unique value of a column in a data fame in R来拆分产品,然后尝试迭代它们并选择head()。但是,split函数总是耗尽内存(无法分配..)
input_split <- split(input, input$Product) #Works here, but not i my problem.
所以最后我的问题是我想要选择不同数量的每个独特产品。所以这里有2件来自1000001,1件来自1000002和1000003。
答案 0 :(得分:10)
有两种解决方案可供选择。 plyr::ddply
是专为您的需求而设计的,但使用data.table
的速度会更快。
您希望将data.frame
拆分为块,删除按日期排序的每个块的最后25%的行,然后重新组合成data.frame
。这可以通过一个简单的线来完成......
require( plyr )
ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
# Product Something Date
#1 1000001 100005 2011-01-01
#2 1000001 100002 2011-01-02
#3 1000001 100006 2011-01-02
#4 1000001 100004 2011-01-04
#5 1000002 100007 2011-01-01
#6 1000002 100003 2011-01-04
#7 1000003 100002 2011-01-02
#8 1000003 100008 2011-01-04
data.table
解决方案对于data.table
,您需要r-forge
的最新开发版本(由于我们在CRAN版本的data.table中有负下标not being implemented)。请务必按照install.package
调用以获取最新版本...
install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )
# Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )
# Return the bottom 75% of rows (i.e. not the earliest)
DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
# Product Something Date
#1: 1000001 100005 2011-01-01
#2: 1000001 100002 2011-01-02
#3: 1000001 100006 2011-01-02
#4: 1000001 100004 2011-01-04
#5: 1000002 100007 2011-01-01
#6: 1000002 100003 2011-01-04
#7: 1000003 100002 2011-01-02
#8: 1000003 100008 2011-01-04
data.table
您可以更轻松地执行此操作(因此您不需要data.table
的开发版本)...
DT[ , .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]
你还可以在lapply
参数中使用j
(我担心我使用.SD
),并且在data.table
的{{1}}上运行约14秒2e6行,90,000个产品(组)......
set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )
system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` , -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
# user system elapsed
# 14.65 0.03 14.74
data.table
!非常感谢 @Arun (现在是data.table
软件包的作者),我们现在有了使用data.table
的最佳方法,即使用{{1}通过使用.I
删除前25%的记录,然后使用这些行索引执行子集以获取最终表,这是所有行索引的整数向量,[
中的子集。这比使用上面的-(1:ceiling(.N*.25))
方法快4-5倍。太棒了!
.SD
答案 1 :(得分:2)
以下是使用mapply
和input
以及table_input
的方法:
#your code
#input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
#colnames(input) <- c( "Product" , "Something" ,"Date")
#input <- as.data.frame(input)
#input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
#input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
#table_input <- as.data.frame(table(input$Product))
#table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#function to "mapply" on "table_input"
fun = function(p, d) { grep(p, input$Product)[1:d] }
#subset "input"
input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]
Product Something Date
1 1000001 100001 2011-01-01
3 1000001 100003 2011-01-01
7 1000002 100002 2011-01-01
11 1000003 100003 2011-01-01
我也呼吁system.time
和replicate
来比较mapply
的速度和SimonO101答案的替代方案:
#SimonO101's code
#require( plyr )
#ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
#install.packages( "data.table" , repos="http://r-forge.r-project.org" )
#require( data.table )
#DT <- data.table( input )
#setkeyv( DT , c( "Product" , "Date" ) )
#DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
> system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
user system elapsed
5.29 0.00 5.29
> system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
user system elapsed
43.48 0.03 44.04
> system.time(replicate(10000, DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] ))
user system elapsed
34.30 0.01 34.50
但是:SimonO101的替代方案 生成的内容与mapply
相同,因为我使用您发布的mapply
使用了table_input
;我不知道这在比较中是否起任何作用。此外,比较可能是我愚蠢的设置。我刚刚做了,因为你指出的速度问题。我真的希望@ SimonO101看到这个,以防我说废话。