跨多列过滤或ifelse

时间:2018-07-12 07:25:52

标签: r if-statement filter multiple-columns selection

我正在研究患者生病时的通讯线路。因此,例如:一个人生病然后去看医生(A),然后去医院(B),开始接触保险(C)等。每个病人的顺序是不同的。例如,一位患者将直接去医院,而另一位患者将首先检查保险等。我们在整个过程中一直跟踪患者,在与其他机构联系后,我们让他们填写另一份调查表。因此,在每个权限(“步骤”)之后,我们得到了调查的分数。这为我提供了以下数据集设置(实际上这是一个非常大的数据集):

Patient<-c(1,1,1,1,1,1,1,2,2,2,2)
sample6<-c("A","A","A","A","A","A","A","A","A","A","A")
sample5<-c("Stop","B","B","B","B","B","B","Stop","C","C","C")
sample4<-c(NA,"Stop","C","C","C","C","C",NA, "Stop","F","F")
sample3<-c(NA,NA,"Stop","D","D","D","D",NA, NA,"Stop","G")
sample2<-c(NA,NA,NA,"Stop","E","E","E",NA, NA,NA,"Stop")
sample1<-c(NA,NA,NA,NA, "Stop","F","F",NA,NA,NA, NA)
sample0<-c(NA,NA,NA,NA, NA,"Stop","G",NA,NA,NA, NA)
sample00<-c(NA,NA,NA,NA, NA,NA,"Stop",NA,NA,NA, NA)
Score<-c(90,88,65,44,78,98,66,38,93,88,80)
Time<-c("01-01-2018", "02-01-2018", "03-01-2018", "04-01-2018", "05-01-2018", "06-01-2018", "07-01-2018","01-02-2018", "02-02-2018", "05-02-2018", "06-02-2018")

df<-data.frame("Patient"=Patient, "step0"=sample6, "step1"=sample5, "step2"=sample4, "step3"=sample3, "step4"=sample2, 
               "step5"=sample1,"step6"= sample0, "step7"=sample00, "Score"=Score, "Time"=Time)

> df
   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time
1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018
2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018
3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018
4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018
5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018
6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018
7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018
8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018
9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018
10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018
11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018

因此,例如:第1行具有权限A之后的调查得分,第2行是针对同一患者的,并且具有权限B之后的调查得分,依此类推。 现在,我想比较具有相同最终过程的列,我将以“ F”为例,但对于其他分析也可以是“ C”。因此,现在我想选择所有指示“ F”作为最终权限的行以及之前的行,以便可以对其进行比较。

所以我想创建这个数据集:

   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018         Before
6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018         After
7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018         Before
10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018         After
11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0

我确实设法指出了包含“ F”加上前一行的行:

ProcessColumns <- 2:9
d <- df[,ProcessColumns] == "F"
df$Indicator <- rowSums(d,na.rm=T)
df$filter[which(df$filter %in% 1)-1] <- "Before"
df$filter[which(df$filter %in% 1)] <- "After"

但是现在它指示所有包含“ F”的行,而不仅仅是结尾。.有谁可以帮助我?

3 个答案:

答案 0 :(得分:2)

我们可以做

df %>% mutate(sum=rowSums(!is.na(.[2:9]))) %>% 
group_by(Patient) %>% mutate(max = sum-max(sum), Indicator  = case_when(max == -2 ~ "Before", max == -1 ~ "After", TRUE ~ as.character(0)))

# A tibble: 11 x 14
# Groups:   Patient [2]
     Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time         sum   max Ind   
     <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <fct>      <dbl> <dbl> <chr> 
 1    1.00 A     Stop  NA    NA    NA    NA    NA    NA     90.0 01-01-2018  2.00 -6.00 0     
 2    1.00 A     B     Stop  NA    NA    NA    NA    NA     88.0 02-01-2018  3.00 -5.00 0     
 3    1.00 A     B     C     Stop  NA    NA    NA    NA     65.0 03-01-2018  4.00 -4.00 0     
 4    1.00 A     B     C     D     Stop  NA    NA    NA     44.0 04-01-2018  5.00 -3.00 0     
 5    1.00 A     B     C     D     E     Stop  NA    NA     78.0 05-01-2018  6.00 -2.00 Before
 6    1.00 A     B     C     D     E     F     Stop  NA     98.0 06-01-2018  7.00 -1.00 After 
 7    1.00 A     B     C     D     E     F     G     Stop   66.0 07-01-2018  8.00  0    0     
 8    2.00 A     Stop  NA    NA    NA    NA    NA    NA     38.0 01-02-2018  2.00 -3.00 0     
 9    2.00 A     C     Stop  NA    NA    NA    NA    NA     93.0 02-02-2018  3.00 -2.00 Before
10    2.00 A     C     F     Stop  NA    NA    NA    NA     88.0 05-02-2018  4.00 -1.00 After 
11    2.00 A     C     F     G     Stop  NA    NA    NA     80.0 06-02-2018  5.00  0    0 

更新:受@Andre Elrico的启发

df %>% unite(All, matches("step"), sep="", remove=F ) %>% 
       mutate(Ind = str_detect(All,"BStop"), Indicator = case_when( lead(Ind) == TRUE ~ "Before", Ind == TRUE ~ "After", TRUE ~ as.character(0))) %>% 
       select(-All,-Ind)

答案 1 :(得分:1)

或者您可以:

library(dplyr)

After_IND <- df %>% apply(.,1,paste,collapse="") %>% grepl("FStop",.)
Before_IND<- lead(After_IND,1,F)

df$Indicator <- 0
df$Indicator[After_IND]<-"After"
df$Indicator[Before_IND]<-"Before"

#  Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
#        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
#        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
#        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
#        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
#        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018    Before
#        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018     After
#        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
#        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
#        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018    Before
#        2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018     After
#        2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0

请注意:

如果要比较B例如您必须更改:

... %>% grepl("BStop",.)

答案 2 :(得分:0)

有很多行的tidyverse,但通常可以使用。

library(tidyverse)
df %>%
  rownames_to_column() %>% 
  gather(k,v,-Patient,-rowname,-Score, -Time) %>% 
  group_by(rowname) %>% 
  mutate(Indicator=ifelse(any(v %in%"F" ),"After",NA)) %>% 
  spread(k,v)  %>% 
  arrange(as.numeric(rowname)) %>% 
  group_by(Patient) %>% 
  mutate(Indicator=ifelse(duplicated(Indicator), NA, Indicator)) %>% 
  mutate(Indicator2=ifelse(lead(Indicator) == "After", "Before", NA)) %>% 
  mutate(Indicator=ifelse(!is.na(Indicator2), Indicator2, Indicator)) %>% 
  select(Patient, starts_with("step"), Score, Time,Indicator, -Indicator2,-rowname) %>% 
  ungroup()
# A tibble: 11 x 12
   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time       Indicator
     <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <fct>      <chr>    
 1       1 A     Stop  NA    NA    NA    NA    NA    NA       90 01-01-2018 NA       
 2       1 A     B     Stop  NA    NA    NA    NA    NA       88 02-01-2018 NA       
 3       1 A     B     C     Stop  NA    NA    NA    NA       65 03-01-2018 NA       
 4       1 A     B     C     D     Stop  NA    NA    NA       44 04-01-2018 NA       
 5       1 A     B     C     D     E     Stop  NA    NA       78 05-01-2018 Before   
 6       1 A     B     C     D     E     F     Stop  NA       98 06-01-2018 After    
 7       1 A     B     C     D     E     F     G     Stop     66 07-01-2018 NA       
 8       2 A     Stop  NA    NA    NA    NA    NA    NA       38 01-02-2018 NA       
 9       2 A     C     Stop  NA    NA    NA    NA    NA       93 02-02-2018 Before   
10       2 A     C     F     Stop  NA    NA    NA    NA       88 05-02-2018 After    
11       2 A     C     F     G     Stop  NA    NA    NA       80 06-02-2018 NA