创建由各种其他变量的可能组合计算得出的变量

时间:2019-11-29 18:03:59

标签: r dplyr

我有一个很大的数据库,其中每一行都是一段文本,并用属于4个不同维度的代码进行了编码。我想创建一个新变量以输出可能的组合。

示例:

Recipe  <-c("ndfkkjd nsakjfbk slgnjkdf", "bffhsbk sbfksdhbk, kbvkbdsk", "asbkdbask", "ouwehowq", "yeueyye fbhfbj")
Origin  <-c("Morocco", "Spain", "France", "Spain", "Italy")
Water   <-c(1,1,0,1, 1)
Oil     <-c(0,0,1,0,0)
Broth   <-c(0,0,1,1,1)
Chicken <-c(1,1,0,0,0)
df <- tibble::tibble(Recipe=Recipe, Origin=Origin, Broth=Broth, Chicken=Chicken, Oil=Oil, Water=Water)

我想要一个变量,显示水或油与肉汤或鸡肉的可能组合。显然,我的数据库要大得多,并且可能的组合扩展到了(13和35组合),所以我真的需要自动进行操作。我知道这些变量的总和不能超过2(即,不包含两种以上的成分)。我期望的输出应如下所示:

`Broth+Oil`     <- c(0,0,1,0,0)
`Broth+Water`   <- c(0,0,0,1,1)
`Chicken+Oil`   <- c(0,0,0,0,0)
`Chicken+Water` <- c(1,1,0,0,0)
df2 <- tibble(`Broth+Oil`,
              `Broth+Water`,
              `Chicken+Oil`,
              `Chicken+Water`)
df3 <- cbind(df, df2)

到目前为止,我只是用所有可能的组合创建了一个向量,但是我真的不知道如何开始考虑它。 任何建议将不胜感激。 非常感谢你!

2 个答案:

答案 0 :(得分:0)

我认为您可以通过拉长小节,过滤1个条目,组合它们并再次扩大小节来获得想要的组合。

请看一下这个

library(tidyverse)

# First batch of variables
cols1 <- c("Water", "Oil")

#Second batch of variables
cols2 <-  c("Broth", "Chicken")

df %>% 
    pivot_longer(cols = cols1, names_to = "col1", values_to = "ind_1") %>% 
    pivot_longer(cols = cols2, names_to = "col2", values_to = "ind_2") %>% 
    filter(ind_1 == 1 & ind_2 == 1) %>% 
    mutate(combined = paste(col1, ' + ', col2)) %>% 
    select(Recipe, Origin, combined) %>% 
    mutate(dummy = 1) %>% 
    pivot_wider(names_from = combined,
                values_from = dummy,
                values_fill = list(dummy = 0))

如果您有任何疑问或我完全不满意,请告诉我!

答案 1 :(得分:0)

对于您要寻找的内容,我有些困惑,但是我会尝试一下。我给您一些解决方案,选择您需要的解决方案。

首先,您的df:

Recipe <- c("ndfkkjd nsakjfbk slgnjkdf", "bffhsbk sbfksdhbk, kbvkbdsk", "asbkdbask", "ouwehowq", "yeueyye fbhfbj")
Origin <- c("Morocco", "Spain", "France", "Spain", "Italy")
Water <- c(1,1,0,1, 1)
Oil <- c(0,0,1,0,0)
Broth <- c(0,0,1,1,1)
Chicken <- c(1,1,0,0,0)
df <- tibble::tibble(Recipe=Recipe, Origin=Origin, Broth=Broth, Chicken=Chicken, Oil=Oil, Water=Water)

第一解决方案

让我们创建一个显示两种成分组合的列。

我们做到了这一古老的应用,并且将您的数据框转换为逻辑值矩阵。 此解决方案不在乎您有多少列,也不在乎每行的总和是否为2或更多。

cols <- c("Broth", "Chicken", "Oil", "Water")

df$comb <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))

df
#> # A tibble: 5 x 7
#> Recipe                      Origin  Broth Chicken   Oil Water comb         
#> <chr>                       <chr>   <dbl>   <dbl> <dbl> <dbl> <chr>        
#> 1 ndfkkjd nsakjfbk slgnjkdf   Morocco     0       1     0     1 Chicken+Water
#> 2 bffhsbk sbfksdhbk, kbvkbdsk Spain       0       1     0     1 Chicken+Water
#> 3 asbkdbask                   France      1       0     1     0 Broth+Oil    
#> 4 ouwehowq                    Spain       1       0     0     1 Broth+Water  
#> 5 yeueyye fbhfbj              Italy       1       0     0     1 Broth+Water  

第二个解决方案

这是为了获得@ c0rias提出的相同解决方案:

library(tidyr)
cols <- c("Broth", "Chicken", "Oil", "Water")
df$comb <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))
df$dummy <- 1
df %>% spread(comb, dummy, fill = 0)

#> # A tibble: 5 x 9
#> Recipe                      Origin  Broth Chicken   Oil Water `Broth+Oil` `Broth+Water` `Chicken+Water`
#> <chr>                       <chr>   <dbl>   <dbl> <dbl> <dbl>       <dbl>         <dbl>           <dbl>
#> 1 asbkdbask                   France      1       0     1     0           1             0               0
#> 2 bffhsbk sbfksdhbk, kbvkbdsk Spain       0       1     0     1           0             0               1
#> 3 ndfkkjd nsakjfbk slgnjkdf   Morocco     0       1     0     1           0             0               1
#> 4 ouwehowq                    Spain       1       0     0     1           0             1               0
#> 5 yeueyye fbhfbj              Italy       1       0     0     1           0             1               0

第三解决方案

但是要获得精确的df3,您需要稍有不同,因为到目前为止,理论上没有出现在数据中的组合不可能出现(谈论Chicken + Oil)。

library(tidyr)
library(dplyr)
library(purrr)
cols <- c("Broth", "Chicken", "Oil", "Water")

# get actual combination
res <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))

# get all possible combination
comb <- expand.grid(col1, col2) %>% pmap_chr(paste, sep = "+")

# create a factor
df$comb <- factor(res, level = comb)

# complete and spread
df$dummy <- 1
df %>% complete(comb) %>% spread(comb, dummy, fill = 0) %>% semi_join(df)

#> # A tibble: 5 x 10
#> Recipe           Origin Broth Chicken   Oil Water `Broth+Oil` `Chicken+Oil` `Broth+Water` `Chicken+Water`
#> <chr>            <chr>  <dbl>   <dbl> <dbl> <dbl>       <dbl>         <dbl>         <dbl>           <dbl>
#> 1 asbkdbask        France     1       0     1     0           1             0             0               0
#> 2 bffhsbk sbfksdh~ Spain      0       1     0     1           0             0             0               1
#> 3 ndfkkjd nsakjfb~ Moroc~     0       1     0     1           0             0             0               1
#> 4 ouwehowq         Spain      1       0     0     1           0             0             1               0
#> 5 yeueyye fbhfbj   Italy      1       0     0     1           0             0             1               0

不幸的是,在tidyr 1.0.0之前,您可能只使用了最后一行:

df %>% spread(comb, dummy, fill = 0, drop = FALSE)

但是现在他们改变了传播的机制。.我展示的是有用的东西,但是我发现对于大数据它并不是真正有效。

编辑:您可以使用以下方法实现第三解决方案:

library(purrr)
cols <- c("Broth", "Chicken", "Oil", "Water")

# get all possible combination
comb <- expand.grid(col1, col2) %>% pmap_chr(paste, sep = "+")

# get actual combination and compare with possible comb
df[comb] <- as.numeric(t(apply(df[cols] == 1, 1,
                               function(x) comb == paste(cols[x], collapse = "+"))))

df
#> # A tibble: 5 x 10
#> Recipe           Origin Broth Chicken   Oil Water `Broth+Oil` `Chicken+Oil` `Broth+Water` `Chicken+Water`
#> <chr>            <chr>  <dbl>   <dbl> <dbl> <dbl>       <dbl>         <dbl>         <dbl>           <dbl>
#> 1 ndfkkjd nsakjfb~ Moroc~     0       1     0     1           0             0             0               1
#> 2 bffhsbk sbfksdh~ Spain      0       1     0     1           0             0             0               1
#> 3 asbkdbask        France     1       0     1     0           1             0             0               0
#> 4 ouwehowq         Spain      1       0     0     1           0             0             1               0
#> 5 yeueyye fbhfbj   Italy      1       0     0     1           0             0             1               0