我有一个很大的数据库,其中每一行都是一段文本,并用属于4个不同维度的代码进行了编码。我想创建一个新变量以输出可能的组合。
示例:
Recipe <-c("ndfkkjd nsakjfbk slgnjkdf", "bffhsbk sbfksdhbk, kbvkbdsk", "asbkdbask", "ouwehowq", "yeueyye fbhfbj")
Origin <-c("Morocco", "Spain", "France", "Spain", "Italy")
Water <-c(1,1,0,1, 1)
Oil <-c(0,0,1,0,0)
Broth <-c(0,0,1,1,1)
Chicken <-c(1,1,0,0,0)
df <- tibble::tibble(Recipe=Recipe, Origin=Origin, Broth=Broth, Chicken=Chicken, Oil=Oil, Water=Water)
我想要一个变量,显示水或油与肉汤或鸡肉的可能组合。显然,我的数据库要大得多,并且可能的组合扩展到了(13和35组合),所以我真的需要自动进行操作。我知道这些变量的总和不能超过2(即,不包含两种以上的成分)。我期望的输出应如下所示:
`Broth+Oil` <- c(0,0,1,0,0)
`Broth+Water` <- c(0,0,0,1,1)
`Chicken+Oil` <- c(0,0,0,0,0)
`Chicken+Water` <- c(1,1,0,0,0)
df2 <- tibble(`Broth+Oil`,
`Broth+Water`,
`Chicken+Oil`,
`Chicken+Water`)
df3 <- cbind(df, df2)
到目前为止,我只是用所有可能的组合创建了一个向量,但是我真的不知道如何开始考虑它。 任何建议将不胜感激。 非常感谢你!
答案 0 :(得分:0)
我认为您可以通过拉长小节,过滤1个条目,组合它们并再次扩大小节来获得想要的组合。
请看一下这个
library(tidyverse)
# First batch of variables
cols1 <- c("Water", "Oil")
#Second batch of variables
cols2 <- c("Broth", "Chicken")
df %>%
pivot_longer(cols = cols1, names_to = "col1", values_to = "ind_1") %>%
pivot_longer(cols = cols2, names_to = "col2", values_to = "ind_2") %>%
filter(ind_1 == 1 & ind_2 == 1) %>%
mutate(combined = paste(col1, ' + ', col2)) %>%
select(Recipe, Origin, combined) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = combined,
values_from = dummy,
values_fill = list(dummy = 0))
如果您有任何疑问或我完全不满意,请告诉我!
答案 1 :(得分:0)
对于您要寻找的内容,我有些困惑,但是我会尝试一下。我给您一些解决方案,选择您需要的解决方案。
首先,您的df:
Recipe <- c("ndfkkjd nsakjfbk slgnjkdf", "bffhsbk sbfksdhbk, kbvkbdsk", "asbkdbask", "ouwehowq", "yeueyye fbhfbj")
Origin <- c("Morocco", "Spain", "France", "Spain", "Italy")
Water <- c(1,1,0,1, 1)
Oil <- c(0,0,1,0,0)
Broth <- c(0,0,1,1,1)
Chicken <- c(1,1,0,0,0)
df <- tibble::tibble(Recipe=Recipe, Origin=Origin, Broth=Broth, Chicken=Chicken, Oil=Oil, Water=Water)
第一解决方案
让我们创建一个显示两种成分组合的列。
我们做到了这一古老的应用,并且将您的数据框转换为逻辑值矩阵。 此解决方案不在乎您有多少列,也不在乎每行的总和是否为2或更多。
cols <- c("Broth", "Chicken", "Oil", "Water")
df$comb <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))
df
#> # A tibble: 5 x 7
#> Recipe Origin Broth Chicken Oil Water comb
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 ndfkkjd nsakjfbk slgnjkdf Morocco 0 1 0 1 Chicken+Water
#> 2 bffhsbk sbfksdhbk, kbvkbdsk Spain 0 1 0 1 Chicken+Water
#> 3 asbkdbask France 1 0 1 0 Broth+Oil
#> 4 ouwehowq Spain 1 0 0 1 Broth+Water
#> 5 yeueyye fbhfbj Italy 1 0 0 1 Broth+Water
第二个解决方案
这是为了获得@ c0rias提出的相同解决方案:
library(tidyr)
cols <- c("Broth", "Chicken", "Oil", "Water")
df$comb <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))
df$dummy <- 1
df %>% spread(comb, dummy, fill = 0)
#> # A tibble: 5 x 9
#> Recipe Origin Broth Chicken Oil Water `Broth+Oil` `Broth+Water` `Chicken+Water`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 asbkdbask France 1 0 1 0 1 0 0
#> 2 bffhsbk sbfksdhbk, kbvkbdsk Spain 0 1 0 1 0 0 1
#> 3 ndfkkjd nsakjfbk slgnjkdf Morocco 0 1 0 1 0 0 1
#> 4 ouwehowq Spain 1 0 0 1 0 1 0
#> 5 yeueyye fbhfbj Italy 1 0 0 1 0 1 0
第三解决方案
但是要获得精确的df3,您需要稍有不同,因为到目前为止,理论上没有出现在数据中的组合不可能出现(谈论Chicken + Oil)。
library(tidyr)
library(dplyr)
library(purrr)
cols <- c("Broth", "Chicken", "Oil", "Water")
# get actual combination
res <- apply(df[cols] == 1, 1, function(x) paste(cols[x], collapse = "+"))
# get all possible combination
comb <- expand.grid(col1, col2) %>% pmap_chr(paste, sep = "+")
# create a factor
df$comb <- factor(res, level = comb)
# complete and spread
df$dummy <- 1
df %>% complete(comb) %>% spread(comb, dummy, fill = 0) %>% semi_join(df)
#> # A tibble: 5 x 10
#> Recipe Origin Broth Chicken Oil Water `Broth+Oil` `Chicken+Oil` `Broth+Water` `Chicken+Water`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 asbkdbask France 1 0 1 0 1 0 0 0
#> 2 bffhsbk sbfksdh~ Spain 0 1 0 1 0 0 0 1
#> 3 ndfkkjd nsakjfb~ Moroc~ 0 1 0 1 0 0 0 1
#> 4 ouwehowq Spain 1 0 0 1 0 0 1 0
#> 5 yeueyye fbhfbj Italy 1 0 0 1 0 0 1 0
不幸的是,在tidyr 1.0.0之前,您可能只使用了最后一行:
df %>% spread(comb, dummy, fill = 0, drop = FALSE)
但是现在他们改变了传播的机制。.我展示的是有用的东西,但是我发现对于大数据它并不是真正有效。
编辑:您可以使用以下方法实现第三解决方案:
library(purrr)
cols <- c("Broth", "Chicken", "Oil", "Water")
# get all possible combination
comb <- expand.grid(col1, col2) %>% pmap_chr(paste, sep = "+")
# get actual combination and compare with possible comb
df[comb] <- as.numeric(t(apply(df[cols] == 1, 1,
function(x) comb == paste(cols[x], collapse = "+"))))
df
#> # A tibble: 5 x 10
#> Recipe Origin Broth Chicken Oil Water `Broth+Oil` `Chicken+Oil` `Broth+Water` `Chicken+Water`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ndfkkjd nsakjfb~ Moroc~ 0 1 0 1 0 0 0 1
#> 2 bffhsbk sbfksdh~ Spain 0 1 0 1 0 0 0 1
#> 3 asbkdbask France 1 0 1 0 1 0 0 0
#> 4 ouwehowq Spain 1 0 0 1 0 0 1 0
#> 5 yeueyye fbhfbj Italy 1 0 0 1 0 0 1 0