示例数据

Question

示例数据

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))

问题

在每个id中选择一个随机起点，然后从该点选择该行和后续，顺序行，总计该ID中行的1％。然后再次为每个ID行的2％和3％执行此操作，依此类推，每个ID最多99％的行。此外，不要选择一个随机点来开始更接近ID行末尾的采样，而不是期望成为样本的百分比（即，不要尝试从a中抽取20％的连续行）指出ID的行数末尾有10％。）

期望结果

dfcombine来自下面第一个代码块的内容，而不是fruit中随机选择的id行，fruit行只会随机启动-point，按顺序在起始点行之后的样本所需的后续行。

我尝试过什么

我可以使用以下代码解决部分问题 - 但它会随机选择所有行，并且我需要在随机起始点之后连续显示样本块（仅供参考：如果你运行它，你就是＆＃39 ;看到你的块开始于6％b / c这是一个小数据集 - 没有行＆lt; 6％的sample-per-id）：

library(tidyverse)

set.seed(123) # pick same sample each time

dflist<-list() # make an empty list

for (i in 1:100) # "do i a hundred times"

{

  i.2<-i/100 # i.2 is i/100
  dflooped <- df %>% # new df
    group_by(id) %>% # group by id
    sample_frac(i.2,replace=TRUE)  # every i.2, take a random sample
  dflooped 
  dflist[[i]]<-dflooped 
}
dflist # check

library(data.table)

dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df

我也可以选择我正在寻找的顺序更大的块 - 但它不允许我随机启动（它始终从df的开头）：

lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])

并使用dplyr group_by吐出错误，我不明白：

df2 <- df %>%
  group_by(id) %>%
  lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])

Error in match.fun(FUN) : 
  'seq(0.01, 1, 0.01)' is not a function, character or symbol

所以我可能会拥有一些部分，但是我很难将它们组合在一起 - 解决方案可能包括也可能不包括我上面所做的事情。感谢。

Answer 1

ID

内的顺序采样

创建虚假数据

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

添加更独特的数据元素以测试数据以测试采样

df$random_numb <- round(runif(nrow(df), 1, 100), 2)

在这里，我们将定义一个函数来执行您想要的操作：

我质疑仅从您不会在此ID类别中“耗尽”观察点的随机样本开始的统计影响。

如果你用完了，回到每个ID类别中的记录顶部会不会更好？这将确保在特定ID字段的任何部分内开始样本的均匀机会，而不是仅限于在数据的前80％内，如果我们想要20％的样本量。只是一个想法！我按照你的要求制作了这个！

random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {

    #browser()

    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - (nrows_to_sample + 1)  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}

单个函数调用的测试函数

只用一个样本测试一定百分比的函数（此处为10％）。这也是重做几个相同函数调用以确保随机起始位置的好方法。

# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)

现在将函数放入for循环

在id字段中留出所有潜在值的唯一列表。同时预留一个样本大小的矢量，格式为百分比（0到1之间）。

# capture all possible values in id field
possible_ids <- unique(df$id)

# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)  


# initialize list:
combined_list <- list()


for(i in 1:length(possible_ids)) {
    #browser()

    print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
    combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}

处理结果

# process results of for loop
combined_list

# number of rows in each df in our list
sapply(combined_list, nrow)

这是所有样本组合的结果数据集

# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)

编辑：

我会留下我最初在那里写的内容，但回想起来，我认为这个实际上更接近你要求的内容。

此解决方案使用相同类型的函数，但我使用嵌套for循环来实现您的要求。

对于每个ID，它将：

此ID值的子集数据框
找到随机起点
样本n％的数据（从1％开始）
重复+ 1％到n（最多99％）

代码：

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)





# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {


    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # don't let us use zero as an index
    if(nrows_to_sample < 1) {
        nrows_to_sample <- 1
    }


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - nrows_to_sample  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}





# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)





# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
    possible_ids <- unique(df$id)

    # these values need to be between 0 and 1 (10% == 0.1)
    sampleperc_sequence <- (1:99 / 100)  

    # adding an expand grid
    ids_sample <- expand.grid(possible_ids, sampleperc_sequence)



# initialize list:
combined_list <- list()
counter <- 1

for(i in 1:length(possible_ids)) {
    for(j in 1:length(sampleperc_sequence)) {
        print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
        combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])

        # manually keep track of counter
        counter <- counter + 1
    }


}


random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])


# process results of for loop
combined_list

    # check size of first list element
    combined_list[[1]]  # A, 10% sample is 1 record


    # check thirtieth element
    combined_list[[30]] # A, 30% sample is 3 records


    # check size of the sixtieth list element
    combined_list[60]   # A, 60% sample is 6 records





sapply(combined_list, nrow)  # number of rows in each df in our list


# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)

逐步采样大量连续行，每个ID随机启动

示例数据

问题

期望结果

我尝试过什么

1 个答案:

ID

创建虚假数据

在这里，我们将定义一个函数来执行您想要的操作：

单个函数调用的测试函数

现在将函数放入for循环

处理结果

这是所有样本组合的结果数据集

编辑：