如何创建具有多个分类变量和交互作用的数据框,并按ID分组?

时间:2019-05-22 18:39:31

标签: sql r group-by categorical-data interaction

我想设置数据框,使其按我的ID列分组,但是有许多列用于我的分类变量和交互作用。

这就是原始表的样子。

+----+----------------+---------+
| ID |      Page      |  Click  |
+----+----------------+---------+
|  1 | homepage       | logo    |
|  1 | homepage       | search  |
|  1 | category page  | logo    |
|  1 | category page  | search  |
|  2 | homepage       | logo    |
|  2 | homepage       | search  |
| .. |                |         | 
+----+----------------+---------+

我想把它做成这样的表。

+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
| ID | Page_homepage  | Page_categorypage  | Click_logo | Click_search  | homepage:search | categorypage:search  | homepage:logo | categorypage:logo |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
|  1 |              1 |                  1 |          1 |             1 |               1 |                    1 |             1 |                 1 |
|  2 |              1 |                  0 |          1 |             1 |               1 |                    0 |             1 |                 0 |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+

我的目标是能够创建具有交互作用的要素以执行逻辑回归。每个ID都有相关的输出,因此对我来说,将结果按ID分组很重要。

最佳和最简单的方法是什么?我不想为所有可能的变化手动进行此操作。我对使用R / Python / SQL执行此操作并不在意。

2 个答案:

答案 0 :(得分:1)

一种解决方法是分别处理各个变量和交互,然后将它们结合在一起:

library(tidyverse)
tbl <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Page = c("homepage", "homepage", "categorypage", "categorypage", "homepage", "homepage"), Click = c("logo", "search", "logo", "search", "logo", "search")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_double", "collector")), Page = structure(list(), class = c("collector_character", "collector")), Click = structure(list(), class = c("collector_character", "collector")), X4 = structure(list(), class = c("collector_logical", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2), class = "col_spec"))
tbl
#> # A tibble: 6 x 3
#>      ID Page         Click 
#>   <dbl> <chr>        <chr> 
#> 1     1 homepage     logo  
#> 2     1 homepage     search
#> 3     1 categorypage logo  
#> 4     1 categorypage search
#> 5     2 homepage     logo  
#> 6     2 homepage     search

tbl %>%
  gather(variable, value, Page, Click) %>%
  transmute(ID, colname = str_c(variable, "_", value), presence = 1) %>%
  distinct() %>% # Individual variables now done, now add interactions
  bind_rows(transmute(tbl, ID, colname = str_c(Page, ":", Click), presence = 1)) %>%
  spread(colname, presence, fill = 0) %>%
  select(ID, matches("Page_"), matches("Click_"), matches(":"))
#> # A tibble: 2 x 9
#>      ID Page_categorypa… Page_homepage Click_logo Click_search
#>   <dbl>            <dbl>         <dbl>      <dbl>        <dbl>
#> 1     1                1             1          1            1
#> 2     2                0             1          1            1
#> # … with 4 more variables: `categorypage:logo` <dbl>,
#> #   `categorypage:search` <dbl>, `homepage:logo` <dbl>,
#> #   `homepage:search` <dbl>

reprex package(v0.2.1)于2019-05-22创建

答案 1 :(得分:1)

好的,这是另一种方法。我试图使它尽可能少地考虑表列名称及其大小。因此,唯一的假设是表的第一列中有id列,其余列的字符类型都与您的示例相同。


library(dplyr)
library(purrr)

df <- data.frame( id = c(1,1,2,2,2,3,3), page = c("home", "home", "home", "cat", "cat", "cat", "hat"), 
                  click = c("search", "logo", "search", "logo", "search", "banana", "banana") )

# auxiliary function for reshape
indicate <- function(x) {
  as.integer(!is_empty(x))
}

# column list for which we want to create the table
cols <- df %>% select(-id) %>% colnames()

# changing variable levels names
purrr::map(cols, function(colname) {
  df %>% pull(colname) %>% gsub("^", paste0(colname, "_"), .)
}) %>% bind_cols() %>% setNames(cols) %>% bind_cols(df %>% select(id), .) -> df2

# creating indicator column for each variable level
purrr::map(cols, function(colname) {
  form.string <- paste("id ~", colname)
  reshape2::dcast(df2, as.formula(form.string), indicate)
}) %>% bind_cols() %>% 
  select(-matches("id\\d+")) -> result

# creating formula for all interactions between variables and joining with the rest of analysis
formula <- paste0("id ~ ", paste(cols, collapse = "+")) %>% as.formula()
df %>% reshape2::dcast(., formula, indicate) %>%
  left_join(., result) -> final_results

print(final_results)