Question

我有一个数据框，我想要在一组列中确定满足给定条件至少特定次数的情况（行）。在下面的玩具示例中，我想确定在三个列中的两个列（Choice_1至Choice_3）中选择“ A”的情况。我不在乎在三个列“ A”中的哪两个中找到。在我的示例中，将标识ID = 1和ID = 4。

这应该可以在任意数量的列中使用任意数量的“ A”（例如，如果我要确定在四个“选择”列中的三个中“ A”是选择的情况，则只能标识ID = 1））。

ID <- 1:4
Choice_1 <- c("A", "B", "C", "D")
Choice_2 <- c("A", "D", "C", "A")
Choice_3 <- c("A", "C", "A", "A")
Choice_4 <- c("B", "B", "A", "B")

df <- data.frame(ID, Choice_1, Choice_2, Choice_3, Choice_4)

> df
ID Choice_1 Choice_2 Choice_3 Choice_4
 1        A        A        A        B
 2        B        D        C        B
 3        C        C        A        A
 4        D        A        A        B

一种回旋方式是将“ A”转换为1，将所有其他值转换为0，将我感兴趣的Choice列求和，并检查总和是否等于或高于我的阈值，但是我觉得好像必须有更好的方法。

按照我的想象，它将是某种形式的if_else语句包含在一个mutate中，因此与该条件匹配的行将被标识为1，而那些没有被标识为0的行：

df %>% mutate(cond_matched = if_else( two of (Choice_1, Choice_2, Choice_3) == "A", 1, 0))

ID Choice_1 Choice_2 Choice_3 Choice_4 cond_matched
 1        A        A        A        B            1
 2        B        D        C        B            0
 3        C        C        A        A            0
 4        D        A        A        B            1

我希望我一直在搜索错误的关键字。谢谢您的帮助！

Answer 1

R的基本选项是从选定的列（df[2:4] == "A"）创建逻辑矩阵，获取TRUE元素的逐行求和，并检查它是否大于或等于2，将逻辑向量强制转换为二进制as.integer或+（hacky）

df$cond_matched <- +(rowSums(df[2:4] == "A") >= 2)
df$cond_matched
#[1] 1 0 0 1

或使用tidyverse（具有与基本R解决方案类似的逻辑，但语法不完全相同）

library(tidyverse)
df %>% 
    mutate(cond_matched = select(., 2:4) %>%
                            map(~ .x == 'A') %>%
                            reduce(`+`) %>%
                            `>=`(2) %>% 
                            as.integer)
#   ID Choice_1 Choice_2 Choice_3 Choice_4 cond_matched
#1  1        A        A        A        B            1
#2  2        B        D        C        B            0
#3  3        C        C        A        A            0
#4  4        D        A        A        B            1

Answer 2

一种dplyr和tidyr的可能性是：

df %>%
 gather(var, val, -c(ID, Choice_4)) %>%
 group_by(ID) %>%
 summarise(cond_matched = as.integer(sum(val == "A") >= 2)) %>%
 ungroup() %>%
 left_join(df, by = c("ID" = "ID"))

     ID cond_matched Choice_1 Choice_2 Choice_3 Choice_4
  <int>        <int> <chr>    <chr>    <chr>    <chr>   
1     1            1 A        A        A        B       
2     2            0 B        D        C        B       
3     3            0 C        C        A        A       
4     4            1 D        A        A        B

或者仅使用dplyr（使用与@akrun基本相同的逻辑）：

df %>%
 mutate(cond_matched = as.integer(rowSums(.[-ncol(.)] == "A") >= 2))

要明确命名列：

df %>%
 mutate(cond_matched = as.integer(rowSums(.[grepl("Choice_1|Choice_2|Choice_3", colnames(.))] == "A") >= 2))

R函数来识别满足条件的情况x在n个列中的任何列中满足x次？

2 个答案: