根据指定的值差异过滤ID

时间:2019-05-26 20:31:07

标签: r filter dplyr difference

我正在尝试根据指定条件过滤ID。例如,我想过滤在治疗前和治疗后问卷得分上有特定差异的ID。这样做的目的是获得ID得分提高,保持不变或恶化的ID。这是我要实现的模拟数据集:-

    ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
    Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
    Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
    df<-cbind(ID,Condition,Score)
    df<-as.data.frame(df)
    df$Condition<-as.factor(df$Condition)

这里的主要问题是,在ID中,两次出现在ID之前和之后的ID都出现了。

我尝试使用dplyr解决方案从主数据帧中选择适当的列,然后使用tidyversespread函数转换为宽格式,因为在那里我可以很容易地计算出差异。但是,我遇到了一个特殊的问题。这是行不通的,因为有重复的实例,其中又有一个ID再次出现在数据中(例如,ID aaa,bbb和ccc)。

     df2<-df%>%
     group_by(ID)%>%
     spread(Condition, Score)

这使我收到以下错误消息:-

  

错误:输出的每一行必须由键的唯一组合标识。   密钥共享12行:   * 10、22   * 11、23   * 12、24   * 1、19   * 2,20   * 3、21   您是否需要使用tibble :: rowid_to_column()创建唯一的ID?

理想情况下,我想要的结果是这样的:-

    #improved
    ID      Pre       Post     Difference
    aaa      23        17           -6
    bbb      20        17           -3
    ggg      20        14           -6
    hhh      19        15           -4
    iii      18        10           -8
    aaa      23        20           -3
    bbb      23        18           -5
    ccc      21        11           -10


    #no improvement
    ID      Pre       Post      Difference
    ccc      19         19          0
    eee      22         22          0
    fff      22         22          0


    #worsened
    ID      Pre       Post      Difference
    ddd      15         20          +5

或者类似的东西。只要允许我包括重复的ID。理想情况下,我希望能够根据差异的大小有条件地进一步过滤。因此,例如,如果我想对ID进行子集/过滤,其ID的得分会提高5分以上,或者差的ID得分会超过5分。裸露一点,我实际的数据集将拥有比示例更多的ID。我刚刚编造并提供。一如既往,任何帮助将不胜感激。

预先感谢您:)

3 个答案:

答案 0 :(得分:2)

一种选择是先将numeric中的'Score'转换为factor,并按'ID''Condition'分组,创建一个序列列('rn'),{{1} }转换为“宽”格式,获取“后”和“前”分数的差异,并通过“差异”列的spread获得split,以创建{{1 }} s

sign

注意:建议不要使用list,因为tibble转换为library(tidyverse) df %>% mutate(Score = as.numeric(as.character(Score))) %>% group_by(ID, Condition) %>% mutate(rn = row_number()) %>% spread(Condition, Score) %>% mutate(Difference = Post -Pre) %>% ungroup %>% select(-rn) %>% group_split(grp = sign(Difference), keep = FALSE) #[[1]] # A tibble: 8 x 4 # ID Post Pre Difference # <fct> <dbl> <dbl> <dbl> #1 aaa 17 23 -6 #2 aaa 20 23 -3 #3 bbb 17 20 -3 #4 bbb 18 23 -5 #5 ccc 11 21 -10 #6 ggg 14 20 -6 #7 hhh 15 19 -4 #8 iii 10 18 -8 #[[2]] # A tibble: 3 x 4 # ID Post Pre Difference # <fct> <dbl> <dbl> <dbl> #1 ccc 19 19 0 #2 eee 22 22 0 #3 fff 22 22 0 #[[3]] # A tibble: 1 x 4 # ID Post Pre Difference # <fct> <dbl> <dbl> <dbl> #1 ddd 20 15 5 并且as.data.frame(cbind仅可以容纳一个类,即如果有一个字符列,所有其他列都将转换为cbind并用matrix包装(默认选项为matrix)。

character

答案 1 :(得分:2)

另一种tidyverse可能性是:

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 arrange(Difference)

   ID    Difference  Post   Pre
   <chr>      <dbl> <dbl> <dbl>
 1 ccc.1        -10    11    21
 2 iii           -8    10    18
 3 aaa           -6    17    23
 4 ggg           -6    14    20
 5 bbb.1         -5    18    23
 6 hhh           -4    15    19
 7 aaa.1         -3    20    23
 8 bbb           -3    17    20
 9 ccc            0    19    19
10 eee            0    22    22
11 fff            0    22    22
12 ddd            5    20    15

在这里,首先创建唯一的ID。其次,它计算差异。最后,将其转换为宽格式,并根据差异进行排列。

如果出于某些原因需要根据差异将其拆分,则可以添加@akrun代码的最后一行:

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 group_split(sign(Difference), keep = FALSE)

[[1]]
# A tibble: 8 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 aaa           -6    17    23
2 aaa.1         -3    20    23
3 bbb           -3    17    20
4 bbb.1         -5    18    23
5 ccc.1        -10    11    21
6 ggg           -6    14    20
7 hhh           -4    15    19
8 iii           -8    10    18

[[2]]
# A tibble: 3 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ccc            0    19    19
2 eee            0    22    22
3 fff            0    22    22

[[3]]
# A tibble: 1 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ddd            5    20    15

答案 2 :(得分:1)

由于Score的呼叫,其他答案地址cbind()是一个因素。这是Base R,data.table和dplyr的解决方案。

所有解决方案都通过添加其他ID变量来解决重复的Group。这样可以使spread成功。

# Base R ------------------------------------------------------------------

df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)

df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]

# data.table --------------------------------------------------------------
library(data.table)

dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]

dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]

# dplyr -------------------------------------------------------------------
library(tidyverse)

tib <- tibble(ID, Condition, Score)

tib%>%
  group_by(Condition)%>%
  mutate(Group = row_number())%>%
  ungroup()%>%
  spread(key = 'Condition', value = 'Score')%>%
  mutate(Difference = Post - Pre)%>%
  arrange(Difference)

对于这个非常小型数据集,基数R最快,而data.table最慢。

Unit: milliseconds
           expr    min      lq     mean  median      uq     max neval
     base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810  6.0653   100
 data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138   100
      dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395  9.5594   100

和数据:

ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))