如何使用共享相同级别的各个列以及使用特定的列值来创建虚拟变量

时间:2019-05-14 16:00:36

标签: r linear-regression dummy-variable

我正在尝试获取下表的虚拟变量:

df1 <- structure(list(Value1 = c(9.330154398, 32.43881489, 54.77178387, 54.77178387),
                      Value2 = c(1, 2, 3, 8),
                      var1 = c("HomeATL", "AwaySDN", "AwayLAN", "AwayLAN"),
                      var2 = c("AwayHOU", "HomeATL", "HomeATL", "HomeATL"),
                      var3 = c("HomeEast", "HomeWest", "AwayEast", "AwayWest"),
                      var3values = c(1,2,3,4),
                      var4 = c("AwayWest", "AwayWest", "HomeSame", "HomeEast"),
                      var4values = c(5,6,7,8)), 
                 class = "data.frame", row.names = c(NA,-4L))

结果应如下所示:

Value1         Value2   HomeEast    HomeWest    AwayEast    AwayWest    HomeSame    HomeATL AwayHOU AwaySDN AwayLAN
9.330154398        1    1   0   0   5   0   1   1   0   0
-32.43881489       2    0   2   0   6   0   1   0   1   0
54.77178387        3    0   0   3   0   7   1   0   0   1
54.77178387        8    8   0   0   4   0   1   0   0   1

我已经问过类似的问题,而我使用的方法是:

library(tidyverse)
rownames_to_column(df1, 'rn') %>%
    gather(key, val, var1:var4) %>% 
    count(rn, val) %>%
    spread(val, n, fill = 0)  %>%
    select(-rn) %>%
    bind_cols(df1[1:2], .)

但是,它返回带有1或0的虚拟值,而不是某些预定义列的值。

我该怎么办?

2 个答案:

答案 0 :(得分:0)

这就是我要做的

one <- df1 %>% select(var1:var2) %>% rownames_to_column('rn') %>% 
gather(key, val, var1:var2) %>% mutate(key = 1) %>% 
spread(val, key, fill = 0) %>% select(-rn)


two <- df1 %>% select(var3:var3values) %>% rownames_to_column('rn') %>% rename(var = 
var3, values = var3values)  %>% 
 bind_rows(df1 %>% 
          select(var4:var4values) %>% 
          rownames_to_column('rn') %>% 
          rename(var = var4, values = var4values)) %>% 
  spread(var, values, fill = 0) %>% 
  select(-rn)

  three <- df1 %>% select(1,2) 

    cbind(three, two, one)

答案 1 :(得分:0)

一种选择是gather列名以{var}开头,后跟一个或多个数字(matches)的列\\d+到结尾({{1} })的字符串,按行号'val'列分组,根据$中指定的条件创建'n',即,如果'key'为'var3',则获得相应的' var3values”,或者如果是'var4',则获取'var4values',如果两个都不是,则获取频率计数(case_when),n(),将其转换为'wide'格式,仅保留感兴趣的列

spread