使用data.table按组抽样而不重复

时间:2018-08-07 05:31:56

标签: r data.table

我将使用一个假设的场景来说明这个问题。这是一张桌子,上面有音乐家和他们演奏的乐器,还有一张桌子的乐队组成:

musicians <- data.table(
  instrument = rep(c('bass','drums','guitar'), each = 4),
  musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)

band.comp <- data.table(
  instrument = c('bass','drums','guitar'),
  n = c(2,1,2)
)

为避免关于谁最适合哪种乐器的争论,乐队将通过分类组装。这是我的做法:

musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]

   instrument     V1
1:       bass   Paul
2:       bass   Chas
3:      drums   Andy
4:     guitar   Paul
5:     guitar George

问题是:由于有些音乐家演奏的乐器不止一种,因此可能会吸引一个人不止一次。

一个人可以建立一个for循环,为每个随后的乐器子集吸引音乐人,然后从表的其余部分中消除音乐人。但我想提出有关如何使用data.table进行操作的建议。主要是因为我需要用这种逻辑在现实生活中解决的这类问题涉及具有成千上万行的数据库。还有,因为我试图更好地理解data.table语法。

作为参考,我尝试了一些tips from Andrew Brooks blog,但无法提出解决方案。

3 个答案:

答案 0 :(得分:3)

这可能是一个解决方案,首先您选择音乐家的乐器,然后再选择样本音乐家。但是可能是当为每个音乐家选择一种乐器时,您的样本数量大于总数,那么您会得到一个错误(但是在您的真实数据中,这可能不是问题)。

musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]

答案 1 :(得分:3)

您可以将band分量扩展到sum(band.comp$n)个不同的位置,并继续采样,直到找到可行的构图为止:

roles = musicians[, 
  CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
, by=instrument]

set.seed(1)
while (TRUE){
  roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
  if (sum(roles$keep) == sum(band.comp$n)) break
}

setorder(roles[keep == TRUE, !"keep"])[]

   instrument posn musician
1:       bass    1   Stuart
2:       bass    2     John
3:      drums    1     Andy
4:     guitar    1   George
5:     guitar    2     Paul

您可能可以使用线性编程或二部图来回答是否存在可行补偿的问题,但是目前尚不清楚“采样”对可行补偿的分布意味着什么。

答案 2 :(得分:1)

在相关的帖子中问:Randomly draw rows from dataframe based on unique values and column values,而eddi的答案非常适合此操作:

#keep number of musicians per instrument in 1 data.table
musicians[band.comp, n:=n, on=.(instrument)]

#for storing the musician that has been sampled so far
m <- c()

musicians[, {
    #exclude sampled musician before sampling
    res <- .SD[!musician %chin% m][sample(.N, n[1L])]
    m <- c(m, res$musician)
    res
}, by=.(instrument)]

示例输出:

   instrument musician n
1:       bass   Stuart 2
2:       bass     Chas 2
3:      drums     Paul 1
4:     guitar     John 2
5:     guitar    Ringo 2

或更简洁地进行错误处理:

m <- c()
musicians[
    band.comp, 
    on=.(instrument), 
    j={
        s <- setdiff(musician, m)
        if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
        res <- sample(s, n)    
        m <- c(m, res)
        res
    }, 
    by=.EACHI]