我正在尝试从三列的大型数据集中获取样本。 样本需要符合以下标准:
在letter1和letter2行中仅使用特定数量的字符。也就是说:以这种方式验证有关字母数量有限的条件:(其中s是解决方案数据帧)
NROW(unique(append(s$letter1,s$letter2)))
4
最小数据集如下所示:
>df
letter1 letter2 value
a b 1
a c 1
a d 3
b a 1
c b 1
c d 2
c e 4
d a 5
d e 1
d c 2
(lettter1中的值总是与letter2中的值不同)。
例如,在示例数据集之外,我想要一个包含1行值4-5,1行值2-3和4行值1的样本。我只想要4个不同字母的组合出现在前两列。在这种情况下,有两种可能的解决方案(以及那些的排列),都使用字母a,b,c,d。我只需要获得一个(或排列),我不关心它。
>s1
letter1 letter2 value
a b 1
a c 1
a d 3
b a 1
c b 1
d a 5
>s2
letter1 letter2 value
a b 1
a c 1
b a 1
c b 1
c d 2
d a 5
解决最小数据框但不是真实(更大)的数据
我已经解决了上面最小数据帧的问题,但遗憾的是我需要处理的实际数据帧太慢了。该解决方案包括:
下面的函数返回第一组u字母,其组合我们可以满足组指定的每个组中所需的行数。 groups是三个elmements的数字向量,表示每个值组的行数。 groups [1]是值为1的必需行数,groups [2]是值为2-3的所需行数,group [3]是值为4-5的所需行数。
library(trotter)
library(gtools)
library(dplyr)
obtainValues <- function (df,u,groups){
# Get unique letter values in the dataframe
lett <- unique(append(df$letter1,df$letter2))
# All possible combinations of letters in lett in groups of u letters
letters <- cpv(u,lett)
# Iterate over any possible group of 4 letters
for(i in 1:length(letters)){
l <- as.character(letters[i])
l <- as.data.frame(permutations(u,2,l),stringAsFactors=FALSE)
names(l) <- c("letter1","letter2")
dc <- semi_join(df,l,by = c("letter1", "letter2"))
#Groups of values of each type with the current letters
g1 <- dc[dc$value == 1,]
g2 <- dc[dc$value > 1 & dc$value < 4,]
g3 <- dc[dc$value > 3,]
if(NROW(g1) >= groups[1] & NROW(g2) >= groups[2] & NROW(g3) >= groups[2]){
# I do not want more rows of each type than the requested ones
g1 <- g1[sample(nrow(g1),groups[1]),]
g2 <- g2[sample(nrow(g2),groups[2]),]
g3 <- g3[sample(nrow(g3),groups[3]),]
# Join chosen rows in a dataframe
g <- rbind(g1,g2)
g <- rbind(g,g3)
return(g)
}
}
}
df <- data.frame(c("a","a","a","b","c","c","c","d","d","d"),c("b","c","d","a","b","d","e","a","e","c"), c(1,1,3,1,1,2,4,5,1,2), stringsAsFactors=FALSE)
names(df) <- c("letter1","letter2","value")
groups <- c(4,1,1)
u <- 4
obtainValues(df,u,tieGroups)
问题
实际数据帧太大,无法迭代超过1:length(字母)。可以分割长度并一次为字母中的不同值组执行for循环,但由于我的数据帧非常大,所以它非常慢。还有其他解决方案吗?有什么方法可以避免for循环?或者以某种方式使我的解决方案适应更大的数据帧?
答案 0 :(得分:0)
I would first do a randomization of the row-orders, then a first split
on the values column (since you know the size of the draws within that grouping), and then pick the first 4 items conditional on !duplicated( paste( letter1, letter2, sep="_"))
rdf1 <- df1[ sample(nrow(df1)) , ]
vspl <- with( rdf1, split(rdf1, findInterval(0,1.5,3.5, 5.5) ) )
#--------
> vspl
$`1`
letter1 letter2 value
3 a d 3
6 c d 2
7 c e 4
5 c b 1
2 a c 1
1 a b 1
10 d c 2
4 b a 1
8 d a 5
9 d e 1
#--------
vspl <- with( rdf1, split(rdf1, findInterval(rdf1$value, c(0,1.5,3.5, 5.5) ) ))
do.call( rbind,
lapply( names(vspl), # names rather than values to later determine number of rows
function(x) vspl[[x]][ # work on the nth item in the list
!duplicated( paste(vspl[[ x]]$letter1,vspl[[x]]$letter2,sep="_")), ][ # drop dupes
1:(if (x=="1") 4 else 1), ,drop=FALSE]) ) # select correct number
letter1 letter2 value
5 c b 1
2 a c 1
1 a b 1
4 b a 1
3 a d 3
7 c e 4