逐步取样

时间:2019-07-02 08:03:15

标签: r random dplyr tibble

我正在尝试通过采样多个步骤来模拟一些数据。

第一步(创建x)工作正常。

第二步,我想通过基于x的值从不同向量中采样来创建变量y。

我的代码运行没有错误,但是由于仅对x ==“ A”采样一个值,然后将该值重新用于x ==“ A”的所有后续行,因此我尝试实现的目标失败。我希望它为x ==“ A”

的每一行采样一次

代码:

library(tidyverse)
set.seed(1)

data <- tibble(
  x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE),
  y = case_when(
    x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
  ))

unique(data$x)
[1] "C" "A" "B"

unique(data$y)
[1] "C1" "A2" "B3"

如果代码按预期运行,unique(data$y)应该返回类似于[1] "A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"

的内容。

我知道问题出在sample()中的size = 1自变量,但是我可以用它代替什么呢?删除它会返回错误:

Error: `x == "A" ~ sample(c("A1", "A2", "A3"), prob = c(0.3, 0.4, 0.3))` must be length 100 or one, not 3

我已经尝试过size = nrow(.data)size=nrow(.),但这也会返回错误。

对此有简单的解决方法吗?

3 个答案:

答案 0 :(得分:2)

也许有一种更简单的方法,但这与您的原始代码很接近,并且可以正常工作...

data <- tibble(
  x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE)) %>%
  rowwise() %>%
  summarise(x= x, 
            y = case_when(
    x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
  ))

答案 1 :(得分:1)

它与矢量化功能和回收有关。如果将其向量化,它将回收相同的值。如果您使用循环执行此操作,它将起作用。例如,

v1 <- c('A', 'A', 'B', 'B', 'C', 'C', 'C', 'A', 'A')

#Vectorized ifelse
ifelse(v1 == 'A', sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)), NA)
#[1] "A3" "A3" NA   NA   NA   NA   NA   "A3" "A3"

#Not vectorized if/else with a loop,
sapply(v1, function(i) if (i == 'A') { sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)) }else {NA})
#   A    A    B    B    C    C    C    A    A 
#"A2" "A3"   NA   NA   NA   NA   NA "A2" "A1" 

答案 2 :(得分:1)

如果将其分为多个步骤,则很容易理解

library(dplyr)
data <- tibble(
   x = sample(c("A", "B", "C"), size = 10000, 
                prob = c(0.1, 0.2, 0.7), replace = TRUE))

data <- data %>%
  mutate(y = case_when(
     x == "A" ~ sample(c("A1", "A2", "A3"), size = n(), 
               prob = c(0.3, 0.4, 0.3), replace = TRUE),
     x == "B" ~ sample(c("B1", "B2", "B3"), size = n(), 
                 prob = c(0.3, 0.4, 0.3), replace = TRUE),
     x == "C" ~ sample(c("C1", "C2", "C3"), size = n(), 
                prob = c(0.3, 0.4, 0.3), replace = TRUE),
)) 

unique(data$y)
#[1] "C2" "B3" "A1" "C3" "B1" "C1" "B2" "A3" "A2"

或者,如果您想继续前进,则需要使用size指定与x提到的参数相同的replace = TRUE

data <- tibble(
  x = sample(c("A", "B", "C"), size = 10000, 
            prob = c(0.1, 0.2, 0.7), replace = TRUE),
  y = case_when(
    x == "A" ~ sample(c("A1", "A2", "A3"), size = 10000, 
                  prob = c(0.3, 0.4, 0.3), replace = TRUE),
    x == "B" ~ sample(c("B1", "B2", "B3"), size = 10000, 
                  prob = c(0.3, 0.4, 0.3), replace = TRUE),
    x == "C" ~ sample(c("C1", "C2", "C3"), size = 10000, 
                  prob = c(0.3, 0.4, 0.3), replace = TRUE),
  ))