Question

我正在尝试从三列的大型数据集中获取样本。样本需要符合以下标准：

获取每个组值的特定行数。有三个组值（1,2-3,4-5）
在letter1和letter2行中仅使用特定数量的字符。也就是说：以这种方式验证有关字母数量有限的条件:(其中s是解决方案数据帧）

NROW(unique(append(s$letter1,s$letter2))) 4

最小数据集如下所示：

>df
letter1 letter2 value
a        b      1
a        c      1
a        d      3
b        a      1
c        b      1
c        d      2
c        e      4
d        a      5
d        e      1
d        c      2

（lettter1中的值总是与letter2中的值不同）。

例如，在示例数据集之外，我想要一个包含1行值4-5，1行值2-3和4行值1的样本。我只想要4个不同字母的组合出现在前两列。在这种情况下，有两种可能的解决方案（以及那些的排列），都使用字母a，b，c，d。我只需要获得一个（或排列），我不关心它。

>s1
letter1 letter2 value       
a        b      1
a        c      1
a        d      3
b        a      1
c        b      1
d        a      5

>s2
letter1 letter2 value
a        b      1
a        c      1
b        a      1
c        b      1
c        d      2
d        a      5

解决最小数据框但不是真实（更大）的数据

我已经解决了上面最小数据帧的问题，但遗憾的是我需要处理的实际数据帧太慢了。该解决方案包括：

获取所需数量的唯一字母的每种可能组合（在此示例中u = 4）。
对于这些组合中的每一个，以两个为一组的排列（它们可以出现在数据框中 letter1 和 letter2 列中的方式）。
使用semi_join（）获取排列数据框（l）中的行也是原始行（df）中的行。
获取属于每个值组的行，并检查它们是否足以满足每个组所需的行数。
如果是，如果有必要满足要求，请在它们之间随机选择。

下面的函数返回第一组u字母，其组合我们可以满足组指定的每个组中所需的行数。 groups是三个elmements的数字向量，表示每个值组的行数。 groups [1]是值为1的必需行数，groups [2]是值为2-3的所需行数，group [3]是值为4-5的所需行数。

library(trotter)
library(gtools)
library(dplyr)

obtainValues <- function (df,u,groups){

# Get unique letter values in the dataframe
lett <- unique(append(df$letter1,df$letter2))

# All possible combinations of letters in lett in groups of u letters
letters <- cpv(u,lett)

# Iterate over any possible group of 4 letters
for(i in 1:length(letters)){

    l <- as.character(letters[i])
    l <- as.data.frame(permutations(u,2,l),stringAsFactors=FALSE)
    names(l) <- c("letter1","letter2")

    dc <- semi_join(df,l,by = c("letter1", "letter2"))


    #Groups of values of each type with the current letters
    g1 <- dc[dc$value == 1,]
    g2 <- dc[dc$value > 1 & dc$value < 4,]
    g3 <- dc[dc$value > 3,] 

    if(NROW(g1) >= groups[1] & NROW(g2) >= groups[2] & NROW(g3) >= groups[2]){

        # I do not want more rows of each type than the requested ones
        g1 <- g1[sample(nrow(g1),groups[1]),] 
        g2 <- g2[sample(nrow(g2),groups[2]),]
        g3 <- g3[sample(nrow(g3),groups[3]),]

        # Join chosen rows in a dataframe
        g <- rbind(g1,g2)
        g <- rbind(g,g3)
        return(g)
    }
 }
}


df <- data.frame(c("a","a","a","b","c","c","c","d","d","d"),c("b","c","d","a","b","d","e","a","e","c"), c(1,1,3,1,1,2,4,5,1,2), stringsAsFactors=FALSE)
names(df) <- c("letter1","letter2","value")

groups <- c(4,1,1)
u <- 4
obtainValues(df,u,tieGroups)

问题

实际数据帧太大，无法迭代超过1：length（字母）。可以分割长度并一次为字母中的不同值组执行for循环，但由于我的数据帧非常大，所以它非常慢。还有其他解决方案吗？有什么方法可以避免for循环？或者以某种方式使我的解决方案适应更大的数据帧？

Answer 1

I would first do a randomization of the row-orders, then a first split on the values column (since you know the size of the draws within that grouping), and then pick the first 4 items conditional on !duplicated( paste( letter1, letter2, sep="_"))

rdf1 <- df1[ sample(nrow(df1)) , ] 
vspl <- with( rdf1, split(rdf1, findInterval(0,1.5,3.5, 5.5) )  )
#--------
> vspl
$`1`
   letter1 letter2 value
3        a       d     3
6        c       d     2
7        c       e     4
5        c       b     1
2        a       c     1
1        a       b     1
10       d       c     2
4        b       a     1
8        d       a     5
9        d       e     1
#--------
 vspl <- with( rdf1, split(rdf1, findInterval(rdf1$value, c(0,1.5,3.5, 5.5) ) ))
do.call( rbind, 
   lapply( names(vspl),   # names rather than values to later determine number of rows
     function(x) vspl[[x]][  # work on the nth item in the list
       !duplicated( paste(vspl[[ x]]$letter1,vspl[[x]]$letter2,sep="_")), ][ # drop dupes
            1:(if (x=="1") 4 else 1), ,drop=FALSE])  ) # select correct number
  letter1 letter2 value
5       c       b     1
2       a       c     1
1       a       b     1
4       b       a     1
3       a       d     3
7       c       e     4

行样本有限组合

1 个答案: