df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))
在每个id
中选择一个随机起点,然后从该点选择该行和后续,顺序行,总计该ID中行的1%。然后再次为每个ID行的2%和3%执行此操作,依此类推,每个ID最多99%的行。此外,不要选择一个随机点来开始更接近ID行末尾的采样,而不是期望成为样本的百分比(即,不要尝试从a中抽取20%的连续行)指出ID的行数末尾有10%。)
dfcombine
来自下面第一个代码块的内容,而不是fruit
中随机选择的id
行,fruit
行只会随机启动-point,按顺序在起始点行之后的样本所需的后续行。
我可以使用以下代码解决部分问题 - 但它会随机选择所有行,并且我需要在随机起始点之后连续显示样本块(仅供参考:如果你运行它,你就是&#39 ;看到你的块开始于6%b / c这是一个小数据集 - 没有行&lt; 6%的sample-per-id):
library(tidyverse)
set.seed(123) # pick same sample each time
dflist<-list() # make an empty list
for (i in 1:100) # "do i a hundred times"
{
i.2<-i/100 # i.2 is i/100
dflooped <- df %>% # new df
group_by(id) %>% # group by id
sample_frac(i.2,replace=TRUE) # every i.2, take a random sample
dflooped
dflist[[i]]<-dflooped
}
dflist # check
library(data.table)
dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df
我也可以选择我正在寻找的顺序更大的块 - 但它不允许我随机启动(它始终从df的开头):
lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])
并使用dplyr group_by
吐出错误,我不明白:
df2 <- df %>%
group_by(id) %>%
lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])
Error in match.fun(FUN) :
'seq(0.01, 1, 0.01)' is not a function, character or symbol
所以我可能会拥有一些部分,但是我很难将它们组合在一起 - 解决方案可能包括也可能不包括我上面所做的事情。感谢。
答案 0 :(得分:1)
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
添加更独特的数据元素以测试数据以测试采样
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
我质疑仅从您不会在此ID类别中“耗尽”观察点的随机样本开始的统计影响。
如果你用完了,回到每个ID类别中的记录顶部会不会更好?这将确保在特定ID字段的任何部分内开始样本的均匀机会,而不是仅限于在数据的前80%内,如果我们想要20%的样本量。只是一个想法!我按照你的要求制作了这个!
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
#browser()
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - (nrows_to_sample + 1) )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
只用一个样本测试一定百分比的函数(此处为10%)。这也是重做几个相同函数调用以确保随机起始位置的好方法。
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
在id
字段中留出所有潜在值的唯一列表。同时预留一个样本大小的矢量,格式为百分比(0到1之间)。
# capture all possible values in id field
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)
# initialize list:
combined_list <- list()
for(i in 1:length(possible_ids)) {
#browser()
print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}
# process results of for loop
combined_list
# number of rows in each df in our list
sapply(combined_list, nrow)
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
我会留下我最初在那里写的内容,但回想起来,我认为这个实际上更接近你要求的内容。
此解决方案使用相同类型的函数,但我使用嵌套for循环来实现您的要求。
对于每个ID,它将:
代码:
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# don't let us use zero as an index
if(nrows_to_sample < 1) {
nrows_to_sample <- 1
}
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - nrows_to_sample )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:99 / 100)
# adding an expand grid
ids_sample <- expand.grid(possible_ids, sampleperc_sequence)
# initialize list:
combined_list <- list()
counter <- 1
for(i in 1:length(possible_ids)) {
for(j in 1:length(sampleperc_sequence)) {
print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])
# manually keep track of counter
counter <- counter + 1
}
}
random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])
# process results of for loop
combined_list
# check size of first list element
combined_list[[1]] # A, 10% sample is 1 record
# check thirtieth element
combined_list[[30]] # A, 30% sample is 3 records
# check size of the sixtieth list element
combined_list[60] # A, 60% sample is 6 records
sapply(combined_list, nrow) # number of rows in each df in our list
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)