Question

我一直在努力寻找这个查询的解决方案，并希望社区可以提供一些灵感。

我有一个大的data.table，包含如下所示的客户活动信息：

library(data.table)
library(dplyr)

DF = as.data.table(NULL)
cust_index = as.data.table(seq(1000,10000,3)) # list of unique customers
colnames(cust_index) = "cust_id"

# create a list of all customer activity - each cust_id represents an active event

for (cust in cust_index$cust_id){
  each_cust = as.data.table(rep(cust, sample(1:17,1, replace=FALSE)))
  DF = bind_rows(DF, each_cust)
  }
rm(each_cust)
colnames(DF) = "cust_id"
setkey(DF, cust_id)

# add dummy data for activity
DF[, A:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, B:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]
DF[, C:= sample(x = c(0,1), size = nrow(DF), replace = TRUE)]

我想从DF中抽取最多4个客户观察结果。

到目前为止，我已经使用了一个函数来对相对于单个客户的观察结果进行采样：

sample.cust = function(x){
  if (nrow(x)<4) {
    cust_sample = x 
  } else {
    cust_sample = x[sample(1:4,replace=FALSE)]
  }
  return(cust_sample)
}

..从for循环中调用。

for (cust in cust_index$cust_id){
  cust.sample = train.data[.(cust), sample.cust(.SD)]
  train.sample = bind_rows(train.sample, cust.sample)
 }

..但是上面的for循环永远不会终止。

我已尝试各种方式：=并设置组合以实现目前为止没有成功。任何建议都会非常受欢迎，因为我认为这将是一个相当简单的解决方案。

非常感谢，微米。

Answer 1

在现在删除的答案中发布了一个解决方案，该答案使用数据表中的.I运算符编制索引：

DF[DF[,sample((.I), min(.N, 4), replace=FALSE), by=cust_id]$V1]

虽然这很有用但忽略了要采样的行数为长度为1的情况。包括data.table中的函数调用实现了正确的结果：

resamp = function(.N, .I){
  if(.N==1) .I else sample((.I), min(.N, 4))
}

DF[ DF[, resamp(.N, .I), by="cust_id"]$V1]

来自data.table的有条件大小的样本

1 个答案: