Question

我建立了一个矩阵，该矩阵的名称是我要在R的回归模型公式中插入的回归子集的名称。例如：

data $ age是响应变量

X是设计矩阵，其列名例如为data $ education和data $ wage。

问题是X的列名不是固定的（即我不知道它们事先是什么），所以我尝试编写以下代码：

best_model <- lm(data$age ~ paste(colnames(x[, GA@solution == 1]), sep = "+"))

但是它不起作用。

Answer 1

与其单独编写公式，不如适当地使用pipe（%>%）和dplyr::select()可能会有所帮助。（在这里，将矩阵更改为数据框。）

library(tidyverse)
mpg
#> # A tibble: 234 x 11
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
#>  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
#>  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
#>  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
#>  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
#>  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
#>  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
#>  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
#>  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
#> 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
#> # ... with 224 more rows

选择

dplyr::select()子集列。

mpg %>% 
  select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
  lm(hwy ~ ., data = .)
#> 
#> Call:
#> lm(formula = hwy ~ ., data = .)
#> 
#> Coefficients:
#>            (Intercept)   manufacturerchevrolet       manufacturerdodge  
#>                2.65526                -1.08632                -2.55442  
#>       manufacturerford       manufacturerhonda     manufacturerhyundai  
#>               -2.29897                -2.98863                -0.94980  
#>       manufacturerjeep  manufacturerland rover     manufacturerlincoln  
#>               -3.36654                -1.87179                -1.10739  
#>    manufacturermercury      manufacturernissan     manufacturerpontiac  
#>               -2.64828                -2.44447                 0.75427  
#>     manufacturersubaru      manufacturertoyota  manufacturervolkswagen  
#>               -3.04204                -2.73963                -1.62987  
#>                  displ                     cyl                     cty  
#>               -0.03763                 0.06134                 1.33805

表示-col.name排除该列。 %>%使公式可以使用.表示法。

Tidyselect

很多数据集使用下划线将其列分组。

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      517            515         2      830
#>  2  2013     1     1      533            529         4      850
#>  3  2013     1     1      542            540         2      923
#>  4  2013     1     1      544            545        -1     1004
#>  5  2013     1     1      554            600        -6      812
#>  6  2013     1     1      554            558        -4      740
#>  7  2013     1     1      555            600        -5      913
#>  8  2013     1     1      557            600        -3      709
#>  9  2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

例如，dep_delay和arr_delay都是关于延迟时间的。 Select helpers，例如starts_with()，ends_with()和contains()可以处理此类列。

nycflights13::flights %>% 
  select(starts_with("sched"),
         ends_with("delay"),
         distance)
#> # A tibble: 336,776 x 5
#>    sched_dep_time sched_arr_time dep_delay arr_delay distance
#>             <int>          <int>     <dbl>     <dbl>    <dbl>
#>  1            515            819         2        11     1400
#>  2            529            830         4        20     1416
#>  3            540            850         2        33     1089
#>  4            545           1022        -1       -18     1576
#>  5            600            837        -6       -25      762
#>  6            558            728        -4        12      719
#>  7            600            854        -5        19     1065
#>  8            600            723        -3       -14      229
#>  9            600            846        -3        -8      944
#> 10            600            745        -2         8      733
#> # ... with 336,766 more rows

之后，只需%>% lm()。

nycflights13::flights %>% 
  select(starts_with("sched"),
         ends_with("delay"),
         distance) %>% 
  lm(dep_delay ~ ., data = .)
#> 
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#> 
#> Coefficients:
#>    (Intercept)  sched_dep_time  sched_arr_time       arr_delay  
#>      -0.151408        0.002737        0.000951        0.816684  
#>       distance  
#>       0.001859

如何用+分隔矩阵的每个列名称

1 个答案:

选择

Tidyselect