Question

我有大量具有以下基本形状的数据文件：

userID <- c(rep(10001, 3), rep(10002, 3), rep(10003, 3))
theValue <- c(NA, "foo", NA, "foo", "bar", NA, "foo", "bar", "foo_and_bar") 

(rawData <- tibble(userID, theValue))

    # A tibble: 9 x 2
  userID theValue   
   <dbl> <chr>      
1  10001 NA         
2  10001 foo        
3  10001 NA         
4  10002 foo        
5  10002 bar        
6  10002 NA         
7  10003 foo        
8  10003 bar        
9  10003 foo_and_bar

我的目标是列出每个用户ID可以与之关联的不同的非NA值：

(df <- rawData %>%
  filter(!is.na(theValue)) %>%
  group_by(userID) %>%
  distinct(theValue))

   theValue    userID
  <chr>        <dbl>
1 foo          10001
2 foo          10002
3 bar          10002
4 foo          10003
5 bar          10003
6 foo_and_bar  10003

而且我还将被要求按某些用户ID分割这些结果...

df[df$userID == 10001, ]

 theValue userID
  <chr>     <dbl>
1 foo       10001

...或者也许将userID视为一个因素：

df$userID <- as.factor(df$userID)

这是问题所在：在我的许多文件中，第一列并不总是称为“ userID”。它可以称为“ userID-A”，“ userID_1”或“ SoylentGreen”……。

我可以动态执行大部分代码：

theID <- "userID"
IDsymbol <- as.symbol(theID)

df2 <- rawData %>%
  filter(!is.na(theValue)) %>%
  group_by(!!IDsymbol) %>%
  distinct(theValue)

identical(df2, df)
[1] TRUE

但是我不知道如何进行切片或因子分配。我看过一些“为dplyr编程”的网站，但是我不确定列出的解决方案适用于我的情况。这是我尝试过的一些示例代码...

df2[theID == 10001, ]
df2[!!IDsymbol == 10001, ]
df2$!!IDsymbol <- as.factor(df2$!!IDsymbol)

...但是它们都返回错误或空数据集。有人可以告诉我我在做什么错吗？

Answer 1

这是使用group_by_at的一种方法，该方法将字符串作为输入，而filter_at

library(dplyr)
rawData %>% 
   filter(complete.cases(theValue)) %>%
   group_by_at(theID) %>% 
   distinct(theValue) %>% 
   filter_at(vars(theID), any_vars(. == 10001))
# A tibble: 1 x 2
# Groups:   userID [1]
#  theValue userID
#  <chr>     <dbl>
#1 foo       10001

或通过转换为符号（sym）并求值（!!）

rawData %>%
     filter(complete.cases(theValue)) %>%
     group_by(!! rlang::sym(theID)) %>% 
     distinct(theValue) %>% 
     filter(!! rlang::sym(theID) == 10001)
# A tibble: 1 x 2
# Groups:   userID [1]
# theValue userID
#  <chr>     <dbl>
#1 foo       10001

OP代码中的问题是试图在tidyverse环境之外（即tidyverse中）应用base R方法。

在dplyr中按动态变量名称切片数据

1 个答案: