Question

让我们说我想对不同样本的mtcars数据集运行几次线性回归模型。这个想法是，对于for循环中的每次迭代，每次运行线性回归时都要存储predict（）方法的结果对于其他样品。一个小例子如下：

## Perform model once on a Sample and use model on full dataset:
Sample_Size <- 10
Sample <- mtcars[sample(nrow(mtcars), Sample_Size), ]
 Model <- lm(formula = mpg ~ wt, data = Sample)
 Predictions <- predict(Model,newdata=mtcars)
 ## Gets us a list with predicted wt for each car:
Predictions <- t(Predictions)

这产生

> Predictions
     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
[1,]  25.80494      23.89161   28.05592       21.34051          19.65228
       Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
 [1,] 19.50221   18.67685  21.52809 21.82822 19.65228  19.65228   14.92523
     Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
 [1,]   17.47633    17.10117           6.071394            4.765828

 .... and so on for other cars

我想每次在for循环中多次执行此过程选择其他样本并获得相应的Predictions（）列表，并将所有Predictions（）结果按行存储在数据框中。

假设我为两个不同的样本运行模型。结果数据框的每一行都应该是该样本的上面的结果，例如：

     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
 [1,]  25.80494      23.89161   28.05592       21.34051          19.65228
 [2,]  22.80492      22.89147   28.05532       21.34231          20.65290
       Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
 [1,] 19.50221   18.67685  21.52809 21.82822 19.65228  19.65228   14.92523
 [2,] 21.83492   23.84147  29.02532 21.34231 20.35290  18.45228   13.92523

 ... and so on for other cars.

关于如何执行此操作的任何想法？我已经开发出一些东西，但要么抛出错误或仅存储最后的结果...我在这里想念什么？

这是我到目前为止所拥有的：

### Inside a for loop, to get a dataframe of Predictions:

Bootstrap_times <- 2
Sample_Size <- 10
Predictions <- list()
Results <-vector ("list",Bootstrap_times)## Stores the Predictions for each run

for(i in 1:Bootstrap_times){
### Take a sample
Sample[[i]] <- mtcars[sample(nrow(mtcars), Sample_Size), ]
### Do the regression on the sample
Model[[i]] <- lm(formula = mpg ~ wt, data = Sample[[i]])
### Perform the predict() on the sample
Predictions[[i]] <- predict(Model[[i]],newdata=mtcars)
### put the result as a line on the dataframe Results
Predictions[[i]] <- t(Predictions[[i]])
return(Predictions)
}

但是，我不断得到：

[[<-.data.frame（*tmp*中的错误，i，值=列表（mpg = c（13.3， 10.4 ，：替换有10行，数据有0

Answer 1

我更喜欢使用magic_for()，但是您也可以很容易地使用R为基数。

这是一个例子：

Bootstrap_times <- 2
Sample_Size     <- 10

Sample      <- mtcars[sample(nrow(mtcars), Sample_Size), ]
Model       <- lm(formula = mpg ~ wt, data = Sample)
Predictions <- predict(Model,newdata=mtcars)
## You like how I line up arrows, right?
Predictions <- t(Predictions)


Predictions <- list()
Results     <-vector ("list",Bootstrap_times)## Stores the Predictions for each run

magicfor::magic_for()
for(i in 1:Bootstrap_times){
  ### Take a sample
  Sample[[i]] <- mtcars[sample(nrow(mtcars), Sample_Size), ]
  ### Do the regression on the sample
  Model[[i]] <- lm(formula = mpg ~ wt, data = Sample[[i]])
  ### Perform the predict() on the sample

  put(predict(Model[[i]],newdata=mtcars))
}

tmp<-magicfor::magic_result_as_dataframe()

tmp

   i predict(Model[[i]],newdata=mtcars)
1  1                          22.858806
2  2                          20.922763
3  1                          25.136504
4  2                          18.341372
5  1                          16.633098
6  2                          16.481252
7  1                          15.646096
8  2                          18.531180
9  1                          18.834873
10 2                          16.633098
11 1                          16.633098
12 2                          11.849933
13 1                          14.431324
14 2                          14.051708
15 1                           2.890988
16 2                           1.569924
17 1                           2.169717
18 2                          26.047583
19 1                          30.489093
20 2                          28.818782
21 1                          24.035616
22 2                          16.025712
23 1                          16.671060
24 2                          13.596168
25 1                          13.558206
26 2                          28.059549
27 1                          26.503122
28 2                          31.263511
29 1                          18.683026
30 2                          21.719957
31 1                          15.646096
32 2                          21.644034
33 1                          22.978374
34 2                          21.584264
35 1                          24.618503
36 2                          19.725450
37 1                          18.495353
38 2                          18.386011
39 1                          17.784630
40 2                          19.862128
41 1                          20.080812
42 2                          18.495353
43 1                          18.495353
44 2                          15.051081
45 1                          16.909894
46 2                          16.636540
47 1                           8.599905
48 2                           7.648629
49 1                           8.080530
50 2                          25.274555
51 1                          28.472808
52 2                          27.270046
53 1                          23.825774
54 2                          18.057985
55 1                          18.522689
56 2                          16.308514
57 1                          16.281178
58 2                          26.723336
59 1                          25.602581
60 2                          29.030452
61 1                          19.971470
62 2                          22.158309
63 1                          17.784630
64 2                          22.103638

Answer 2

我的版本：

# load data
data(mtcars)
N <- nrow(mtcars)

# bootstrap parameters
sample_size <- 10
bootstrap_times <- 20

# create empty storage matrix of results
# one row per bootstrap sample, one column per predicted weight
res_mat <- matrix(NA, nrow=bootstrap_times, ncol=N)
colnames(res_mat) <- rownames(mtcars)

# do bootstrap
for (i in seq(bootstrap_times)) {
    this_sample <- sample(N, sample_size, replace=FALSE)
    reg_result  <- lm(mpg ~ wt, data=mtcars[this_sample,])
    res_mat[i,] <- predict(reg_result, mtcars)
}

Answer 3

这是使用嵌套data.frames的tidyverse方法：

library(tidyverse)

Bootstrap_times <- 2
Sample_Size <- 10

Predictions <- data.frame(SampleID = 1:Bootstrap_times) %>%
  group_by(SampleID) %>%
  nest() %>%
  mutate(data = data %>% map(~mtcars[sample(nrow(mtcars), Sample_Size), ]),
         Model = data %>% map(~lm(formula = mpg ~ wt, data = .)),
         Predictions = map2(Model, data, ~predict(.x, newdata = .y))) %>%
  select(SampleID, Predictions) %>%
  unnest()

结果：

# A tibble: 20 x 2
   SampleID Predictions
      <int>       <dbl>
 1        1        22.7
 2        1        16.2
 3        1        19.7
 4        1        21.5
 5        1        18.7
 6        1        17.4
 7        1        23.3
 8        1        10.7
 9        1        18.8
10        1        19.8
11        2        11.4
12        2        19.6
13        2        11.7
14        2        18.1
15        2        21.1
16        2        18.6
17        2        16.2
18        2        23.5
19        2        19.7
20        2        20.7

此方法的优点是非常容易从模型中提取其他信息（使用broom并合并为一个数据。frame输出：

library(broom)

data.frame(SampleID = 1:Bootstrap_times) %>%
  group_by(SampleID) %>%
  nest() %>%
  mutate(data = data %>% map(~mtcars[sample(nrow(mtcars), Sample_Size), ]),
         Model = data %>% map(~lm(formula = mpg ~ wt, data = .) %>% augment())) %>%
  select(-data) %>%
  unnest()

结果：

# A tibble: 20 x 11
   SampleID .rownames            mpg    wt .fitted .se.fit .resid  .hat .sigma  .cooksd .std.resid
      <int> <chr>              <dbl> <dbl>   <dbl>   <dbl>  <dbl> <dbl>  <dbl>    <dbl>      <dbl>
 1        1 Dodge Challenger    15.5  3.52   17.2    0.689 -1.72  0.106   2.15 0.0442      -0.862 
 2        1 Datsun 710          22.8  2.32   23.5    0.940 -0.655 0.198   2.24 0.0148      -0.346 
 3        1 Cadillac Fleetwood  10.4  5.25    8.24   1.52   2.16  0.515   1.93 1.15         1.47  
 4        1 Merc 450SE          16.4  4.07   14.4    0.863  2.04  0.167   2.10 0.112        1.06  
 5        1 Ford Pantera L      15.8  3.17   19.0    0.672 -3.24  0.101   1.85 0.147       -1.62  
 6        1 Lotus Europa        30.4  1.51   27.6    1.39   2.75  0.432   1.79 1.14         1.73  
 7        1 Volvo 142E          21.4  2.78   21.1    0.751  0.334 0.126   2.26 0.00207      0.169 
 8        1 Merc 280C           17.8  3.44   17.6    0.678  0.163 0.103   2.26 0.000378     0.0812
 9        1 Mazda RX4 Wag       21    2.88   20.6    0.724  0.428 0.117   2.25 0.00308      0.215 
10        1 Camaro Z28          13.3  3.84   15.6    0.773 -2.26  0.134   2.06 0.102       -1.15  
11        2 Merc 280            19.2  3.44   19.7    1.09  -0.470 0.108   3.53 0.00138     -0.151 
12        2 Toyota Corolla      33.9  1.84   28.2    1.65   5.66  0.251   2.52 0.658        1.98  
13        2 Hornet Sportabout   18.7  3.44   19.7    1.09  -0.970 0.108   3.51 0.00588     -0.311 
14        2 Mazda RX4 Wag       21    2.88   22.7    1.07  -1.69  0.106   3.47 0.0173      -0.540 
15        2 Chrysler Imperial   14.7  5.34    9.50   2.42   5.20  0.539   2.02 3.15         2.32  
16        2 Camaro Z28          13.3  3.84   17.5    1.26  -4.23  0.145   3.08 0.163       -1.39  
17        2 Valiant             18.1  3.46   19.6    1.09  -1.46  0.110   3.48 0.0136      -0.469 
18        2 Porsche 914-2       26    2.14   26.6    1.43  -0.611 0.188   3.52 0.00490     -0.205 
19        2 Merc 280C           17.8  3.44   19.7    1.09  -1.87  0.108   3.45 0.0219      -0.600 
20        2 Lotus Europa        30.4  1.51   30.0    1.91   0.441 0.335   3.52 0.00677      0.164

注意：

使用此方法，您甚至不需要预测步骤（除非您正在使用新数据），因为您拥有.fitted中的augment个值。

由于未设置种子，因此第一和第二输出之间的预测有所不同。

将predict（）的结果放在列表内的for循环中

3 个答案: