R-根据涉及其他列组的条件来更改列值

时间:2018-07-22 19:58:13

标签: r group-by dplyr mutate

我要从数据表df中获得四列,我希望以此为基础建立第五列。当前的四个列是-yearmonthidconflict。现在,conflict列只有1和0,并且对于给定的ID分组,如果一年中出现1,则在该年的其余月份中出现1。我想将conflict列更改为新列conflict_mutated,如下所示:如果我们在给定的年份中任何月份包含1,而上一年在任何月份包含1,我想要conflict_mutated的当前月份为全1,同时也保留了所有旧的1。

因此,如果我们有以下数据:

year month id conflict
1989 6     33 0
1989 7     33 0
1989 8     33 1
1989 9     33 1
1989 10    33 1
1989 11    33 1
1989 12    33 1
1990 1     33 0
1990 3     33 0
1990 3     33 0
1990 4     33 0
1990 5     33 1
1990 6     33 1
1990 7     33 1
1990 8     33 1
1990 9     33 1
1990 10    33 1
1990 11    33 1
1990 12    33 1

因此,我希望在conlfict的第1、2、3和4个月中将0设为1,因为它们具有相同的id,并且1989(上一年)和1990的年份中都带有1。前面的示例数据将如下所示:

year month id conflict conflict_mutated
1989 6     33 0        0
1989 7     33 0        0
1989 8     33 1        1
1989 9     33 1        1
1989 10    33 1        1
1989 11    33 1        1
1989 12    33 1        1
1990 1     33 0        1
1990 3     33 0        1
1990 3     33 0        1
1990 4     33 0        1
1990 5     33 1        1
1990 6     33 1        1
1990 7     33 1        1
1990 8     33 1        1
1990 9     33 1        1
1990 10    33 1        1
1990 11    33 1        1
1990 12    33 1        1

我有一个解决方案,但大约需要3天才能完成。如下:

conflict_mutated = df$conflict

for (i in 1:length(nrow(df)) {
  if (df$year[i] != 1989 & any(filter(df, id == df$id[i], 
    year == (df$year[i] - 1))$conflict == 1) & 
    any(filter(df, id == df$id[i], year == df$year[i])$conflict == 1)) 
        {conflict_mutated[i] = 1}

有没有办法利用group_by和mutate来使它更快或更好?在考虑如何考虑分组年份时会遇到麻烦,因此必须加以考虑,并在与各种id组合的条件逻辑中转移。

2 个答案:

答案 0 :(得分:0)

foo  <- read_csv('df1.csv')
#print(foo, n =40)
## A tibble: 40 x 4
#    year month    id conflict
#   <int> <int> <int>    <int>
# 1  1989     6    33        0
# 2  1989     7    33        0
# 3  1989     8    33        1
# 4  1989     9    33        1
# 5  1989    10    33        1
# 6  1989    11    33        1
# 7  1989    12    33        1
# 8  1990     1    33        0
# 9  1990     3    33        0
#10  1990     3    33        0
#11  1990     4    33        0
#12  1990     5    33        1
#13  1990     6    33        1
#14  1990     7    33        1
#15  1990     8    33        1
#16  1990     9    33        1
#17  1990    10    33        1
#18  1990    11    33        1
#19  1990    12    33        1
#20  1991     1    33        0
#21  1989     6    34        0
#22  1989     7    34        0
#23  1989     8    34        1
#24  1989     9    34        1
#25  1989    10    34        1
#26  1989    11    34        1
#27  1989    12    34        1
#28  1990     1    34        0
#29  1990     3    34        0
#30  1990     3    34        0
#31  1990     4    34        0
#32  1990     5    34        1
#33  1990     6    34        1
#34  1990     7    34        1
#35  1990     8    34        1
#36  1990     9    34        1
#37  1990    10    34        1
#38  1990    11    34        1
#39  1990    12    34        1
#40  1991     1    34        0
bar  <-  foo %>% group_by(id, year) %>% dplyr::summarize(yrtot = sum(conflict))
library(data.table)
bar  %<>% ungroup() %>% group_by(id)  %>%  dplyr::mutate(lastyrtot=shift(yrtot, n=1))
foo  %<>%  left_join( bar)  %>% 
        dplyr::mutate(conflict_mutate = ifelse(yrtot>1 & lastyrtot >1,1,0) )
foo %<>% dplyr::mutate(conflict_mutate  =  ifelse(is.na(lastyrtot), conflict, conflict_mutate)) %>% select(-yrtot, -lastyrtot) 

#R> print(foo, n=40)
## A tibble: 40 x 5
#    year month    id conflict conflict_mutate
#   <int> <int> <int>    <int>           <dbl>
# 1  1989     6    33        0               0
# 2  1989     7    33        0               0
# 3  1989     8    33        1               1
# 4  1989     9    33        1               1
# 5  1989    10    33        1               1
# 6  1989    11    33        1               1
# 7  1989    12    33        1               1
# 8  1990     1    33        0               1
# 9  1990     3    33        0               1
#10  1990     3    33        0               1
#11  1990     4    33        0               1
#12  1990     5    33        1               1
#13  1990     6    33        1               1
#14  1990     7    33        1               1
#15  1990     8    33        1               1
#16  1990     9    33        1               1
#17  1990    10    33        1               1
#18  1990    11    33        1               1
#19  1990    12    33        1               1
#20  1991     1    33        0               0
#21  1989     6    34        0               0
#22  1989     7    34        0               0
#23  1989     8    34        1               1
#24  1989     9    34        1               1
#25  1989    10    34        1               1
#26  1989    11    34        1               1
#27  1989    12    34        1               1
#28  1990     1    34        0               1
#29  1990     3    34        0               1
#30  1990     3    34        0               1
#31  1990     4    34        0               1
#32  1990     5    34        1               1
#33  1990     6    34        1               1
#34  1990     7    34        1               1
#35  1990     8    34        1               1
#36  1990     9    34        1               1
#37  1990    10    34        1               1
#38  1990    11    34        1               1
#39  1990    12    34        1               1
#40  1991     1    34        0               0

答案 1 :(得分:0)

这是一个肮脏的解决方案“我在lapplymap上遇到了一些问题”,我敢肯定有人会提出来并提出一些整洁的东西

for(j in seq_along(dflist)){
  for (i in seq_along(dflist[[j]])){
    #print(paste(j,i))
    if(i==1){
      dflist[[j]][[i]] <- mutate(dflist[[j]][[i]], val = i, conflict_mutated = ifelse(cumany(conflict),1,0)) #make sure column 
    } else{
      dflist[[j]][[i]] <-  mutate(dflist[[j]][[i]], val = i-1, 
                           conflict_mutated = ifelse(any(conflict)==1 & any(dflist[[j]][[i-1]]['conflict'])==1 ,1,0) )
    }

  }
  }

数据

df <- read.table(text="
     year month id conflict
             1989 6     33 0
             1989 7     33 0
             1989 8     33 1
             1989 9     33 1
             1989 10    33 1
             1989 11    33 1
             1989 12    33 1
             1990 1     33 0
             1990 3     33 0
             1990 3     33 0
             1990 4     33 0
             1990 5     33 1
             1990 6     33 1
             1990 7     33 1
             1990 8     33 1
             1990 9     33 1
             1990 10    33 1
             1990 11    33 1
             1990 12    33 1
              ",header=T, stringsAsFactors = F)

df1 <- df
df1$id<- 44
df2 <- rbind(df,df1) %>% arrange(id, year, month) #For split and cumany to work correctly
dflist <- lapply( (df2 %>% split(., .[,'id'])), function(xtbl) xtbl %>% split(., .[,'year']))