逐步采样大量连续行,每个ID随机启动

时间:2017-03-15 03:10:35

标签: r loops random dplyr

示例数据

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))

问题

在每个id中选择一个随机起点,然后从该点选择该行和后续,顺序行,总计该ID中行的1%。然后再次为每个ID行的2%和3%执行此操作,依此类推,每个ID最多99%的行。此外,不要选择一个随机点来开始更接近ID行末尾的采样,而不是期望成为样本的百分比(即,不要尝试从a中抽取20%的连续行)指出ID的行数末尾有10%。)

期望结果

dfcombine来自下面第一个代码块的内容,而不是fruit中随机选择的id行,fruit行只会随机启动-point,按顺序在起始点行之后的样本所需的后续行。

我尝试过什么

我可以使用以下代码解决部分问题 - 但它会随机选择所有行,并且我需要在随机起始点之后连续显示样本块(仅供参考:如果你运行它,你就是&#39 ;看到你的块开始于6%b / c这是一个小数据集 - 没有行&lt; 6%的sample-per-id):

library(tidyverse)

set.seed(123) # pick same sample each time

dflist<-list() # make an empty list

for (i in 1:100) # "do i a hundred times"

{

  i.2<-i/100 # i.2 is i/100
  dflooped <- df %>% # new df
    group_by(id) %>% # group by id
    sample_frac(i.2,replace=TRUE)  # every i.2, take a random sample
  dflooped 
  dflist[[i]]<-dflooped 
}
dflist # check

library(data.table)

dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df

我也可以选择我正在寻找的顺序更大的块 - 但它不允许我随机启动(它始终从df的开头):

lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])

并使用dplyr group_by吐出错误,我不明白:

df2 <- df %>%
  group_by(id) %>%
  lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])

Error in match.fun(FUN) : 
  'seq(0.01, 1, 0.01)' is not a function, character or symbol

所以我可能会拥有一些部分,但是我很难将它们组合在一起 - 解决方案可能包括也可能不包括我上面所做的事情。感谢。

1 个答案:

答案 0 :(得分:1)

ID

内的顺序采样

创建虚假数据

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

添加更独特的数据元素以测试数据以测试采样

df$random_numb <- round(runif(nrow(df), 1, 100), 2)

在这里,我们将定义一个函数来执行您想要的操作:

我质疑仅从您不会在此ID类别中“耗尽”观察点的随机样本开始的统计影响。

如果你用完了,回到每个ID类别中的记录顶部会不会更好?这将确保在特定ID字段的任何部分内开始样本的均匀机会,而不是仅限于在数据的前80%内,如果我们想要20%的样本量。只是一个想法!我按照你的要求制作了这个!

random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {

    #browser()

    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - (nrows_to_sample + 1)  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}

单个函数调用的测试函数

只用一个样本测试一定百分比的函数(此处为10%)。这也是重做几个相同函数调用以确保随机起始位置的好方法。

# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)

现在将函数放入for循环

id字段中留出所有潜在值的唯一列表。同时预留一个样本大小的矢量,格式为百分比(0到1之间)。

# capture all possible values in id field
possible_ids <- unique(df$id)

# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)  


# initialize list:
combined_list <- list()


for(i in 1:length(possible_ids)) {
    #browser()

    print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
    combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}

处理结果

# process results of for loop
combined_list

# number of rows in each df in our list
sapply(combined_list, nrow)  

这是所有样本组合的结果数据集

# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)

编辑:

我会留下我最初在那里写的内容,但回想起来,我认为这个实际上更接近你要求的内容。

此解决方案使用相同类型的函数,但我使用嵌套for循环来实现您的要求。

对于每个ID,它将:

  • 此ID值的子集数据框
  • 找到随机起点
  • 样本n%的数据(从1%开始)
  • 重复+ 1%到n(最多99%)

代码:

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)





# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {


    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # don't let us use zero as an index
    if(nrows_to_sample < 1) {
        nrows_to_sample <- 1
    }


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - nrows_to_sample  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}





# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)





# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
    possible_ids <- unique(df$id)

    # these values need to be between 0 and 1 (10% == 0.1)
    sampleperc_sequence <- (1:99 / 100)  

    # adding an expand grid
    ids_sample <- expand.grid(possible_ids, sampleperc_sequence)



# initialize list:
combined_list <- list()
counter <- 1

for(i in 1:length(possible_ids)) {
    for(j in 1:length(sampleperc_sequence)) {
        print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
        combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])

        # manually keep track of counter
        counter <- counter + 1
    }


}


random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])


# process results of for loop
combined_list

    # check size of first list element
    combined_list[[1]]  # A, 10% sample is 1 record


    # check thirtieth element
    combined_list[[30]] # A, 30% sample is 3 records


    # check size of the sixtieth list element
    combined_list[60]   # A, 60% sample is 6 records





sapply(combined_list, nrow)  # number of rows in each df in our list


# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)