Question

我有以下数据框：

Index <- seq.int(1:10)
A <- c(5, 5, 3, 4, 3, 3, 2, 2, 4, 3)
B <- c(10, 11, 12, 12, 12, 11, 13, 13, 14, 13)
C <- c(7, 6, 7, 7, 6, 5, 6, 5, 5, 4)
df <- data.frame(Index, A, B, C)
> df
      Index A  B C
 [1,]     1 5 10 7
 [2,]     2 5 11 6
 [3,]     3 3 12 7
 [4,]     4 4 12 7
 [5,]     5 3 12 6
 [6,]     6 3 11 5
 [7,]     7 2 13 6
 [8,]     8 2 13 5
 [9,]     9 4 14 5
[10,]    10 3 13 4

我想使用Index列作为因变量，并使用所有以下项来生成线性模型（并最终以易于使用的数据框形式获取斜率，截距和确定系数）其他列分别作为响应变量。我知道我可以通过运行以下代码行来做到这一点：

summary(lm(cbind(A, B, C) ~ Index, data = df))

上面的代码行有一个问题，就是它使用了cbind函数，因此，我必须分别输入每一列。我正在处理具有许多列的大型数据框，而不希望使用cbind函数，而是希望能够通过编写该函数一次告诉函数使用一堆列（即响应变量）像df[, 2:ncol(df)]代替cbind(A, B, C)。

上述代码行的另一个问题是输出的格式不是用户友好的形式。最终，我希望输出（斜率，截距和确定系数）采用易于使用的数据框形式：

response <- c("A", "B", "C")
slope <- c(-0.21818, 0.33333, -0.29091)
intercept <- c(4.60000, 10.26667, 7.40000)
r.squared <- c(0.3776, 0.7106, 0.7273)
summary_df <- data.frame(response, slope, intercept, r.squared)
> summary_df
  response    slope intercept r.squared
1        A -0.21818   4.60000    0.3776
2        B  0.33333  10.26667    0.7106
3        C -0.29091   7.40000    0.7273

最有效的方法是什么？使用lapply函数肯定有解决方案，但我没有得到。非常感谢！

Answer 1

要解决查询的第一部分，您可以将matrix个对象传递给lm的公式侧：

summary(lm(as.matrix(df[-1]) ~ as.matrix(df[1])))

根据报告的系数进行检查：

all.equal(
  coef(lm(as.matrix(df[-1]) ~ as.matrix(df[1]))),
  coef(lm(cbind(A,B,C) ~ Index, data=df)),
  check.attributes=FALSE
)
#[1] TRUE

请注意李哲源发出的警告，即像matrix(...) ~ .这样组合将无法按预期工作。通常将两边都指定为表达式，或者将两边仅指定为矩阵可能更安全。

Answer 2

我会将数据帧转换为小标题。这样，您就可以使用in this presentation中所述的列表列来存储和操作模型。

让我们调用数据帧df1，而不是df。转换为小标题，然后使用tidyr::gather()和tidyr::nest重塑形状：

library(tidyverse)
library(broom)

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var)

结果是一个小标题，其中A，B，C的每一行都有一行，而data的一列则存储着Index的一列和Val的对应值。 A，B，C。

# A tibble: 3 x 2
  Var   data             
  <chr> <list>           
1 A     <tibble [10 x 2]>
2 B     <tibble [10 x 2]>
3 C     <tibble [10 x 2]>

现在，我们可以使用dplyr::mutate()和purrr::map创建一列，其中包含A，B和C的模型。

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)))

# A tibble: 3 x 3
  Var   data              model   
  <chr> <list>            <list>  
1 A     <tibble [10 x 2]> <S3: lm>
2 B     <tibble [10 x 2]> <S3: lm>
3 C     <tibble [10 x 2]> <S3: lm>

最后，我们可以使用broom::glance()或broom::tidy()从模型中提取所需的值，然后使用tidyr::unnest()返回正常的小标题。

使用glance：

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)), 
         summary = map(model, glance)) %>% 
  unnest(summary) %>% 
  select(-data, -model)

# A tibble: 3 x 12
  Var   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 A         0.378         0.300  2.53      4.85 0.0587      2  -22.4  50.7  51.6     51.3           8
2 B         0.711         0.674  1.73     19.6  0.00219     2  -18.5  43.1  44.0     23.9           8
3 C         0.727         0.693  1.68     21.3  0.00171     2  -18.2  42.5  43.4     22.5           8

使用tidy：

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)), 
         summary = map(model, tidy)) %>% 
  unnest(summary)

# A tibble: 6 x 6
  Var   term        estimate std.error statistic  p.value
  <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 A     (Intercept)    11.4      2.79       4.08 0.00352 
2 A     Val            -1.73     0.786     -2.20 0.0587  
3 B     (Intercept)   -20.3      5.85      -3.47 0.00842 
4 B     Val             2.13     0.481      4.43 0.00219 
5 C     (Intercept)    20        3.18       6.28 0.000237
6 C     Val            -2.5      0.541     -4.62 0.00171

从多个线性模型获得斜率，截距和确定系数，所有这些都来自同一数据帧

2 个答案: