我要从数据表df
中获得四列,我希望以此为基础建立第五列。当前的四个列是-year
,month
,id
和conflict
。现在,conflict
列只有1和0,并且对于给定的ID分组,如果一年中出现1,则在该年的其余月份中出现1。我想将conflict
列更改为新列conflict_mutated
,如下所示:如果我们在给定的年份中任何月份包含1,而上一年在任何月份包含1,我想要conflict_mutated
的当前月份为全1,同时也保留了所有旧的1。
因此,如果我们有以下数据:
year month id conflict
1989 6 33 0
1989 7 33 0
1989 8 33 1
1989 9 33 1
1989 10 33 1
1989 11 33 1
1989 12 33 1
1990 1 33 0
1990 3 33 0
1990 3 33 0
1990 4 33 0
1990 5 33 1
1990 6 33 1
1990 7 33 1
1990 8 33 1
1990 9 33 1
1990 10 33 1
1990 11 33 1
1990 12 33 1
因此,我希望在conlfict
的第1、2、3和4个月中将0设为1,因为它们具有相同的id,并且1989(上一年)和1990的年份中都带有1。前面的示例数据将如下所示:
year month id conflict conflict_mutated
1989 6 33 0 0
1989 7 33 0 0
1989 8 33 1 1
1989 9 33 1 1
1989 10 33 1 1
1989 11 33 1 1
1989 12 33 1 1
1990 1 33 0 1
1990 3 33 0 1
1990 3 33 0 1
1990 4 33 0 1
1990 5 33 1 1
1990 6 33 1 1
1990 7 33 1 1
1990 8 33 1 1
1990 9 33 1 1
1990 10 33 1 1
1990 11 33 1 1
1990 12 33 1 1
我有一个解决方案,但大约需要3天才能完成。如下:
conflict_mutated = df$conflict
for (i in 1:length(nrow(df)) {
if (df$year[i] != 1989 & any(filter(df, id == df$id[i],
year == (df$year[i] - 1))$conflict == 1) &
any(filter(df, id == df$id[i], year == df$year[i])$conflict == 1))
{conflict_mutated[i] = 1}
有没有办法利用group_by和mutate来使它更快或更好?在考虑如何考虑分组年份时会遇到麻烦,因此必须加以考虑,并在与各种id组合的条件逻辑中转移。
答案 0 :(得分:0)
foo <- read_csv('df1.csv')
#print(foo, n =40)
## A tibble: 40 x 4
# year month id conflict
# <int> <int> <int> <int>
# 1 1989 6 33 0
# 2 1989 7 33 0
# 3 1989 8 33 1
# 4 1989 9 33 1
# 5 1989 10 33 1
# 6 1989 11 33 1
# 7 1989 12 33 1
# 8 1990 1 33 0
# 9 1990 3 33 0
#10 1990 3 33 0
#11 1990 4 33 0
#12 1990 5 33 1
#13 1990 6 33 1
#14 1990 7 33 1
#15 1990 8 33 1
#16 1990 9 33 1
#17 1990 10 33 1
#18 1990 11 33 1
#19 1990 12 33 1
#20 1991 1 33 0
#21 1989 6 34 0
#22 1989 7 34 0
#23 1989 8 34 1
#24 1989 9 34 1
#25 1989 10 34 1
#26 1989 11 34 1
#27 1989 12 34 1
#28 1990 1 34 0
#29 1990 3 34 0
#30 1990 3 34 0
#31 1990 4 34 0
#32 1990 5 34 1
#33 1990 6 34 1
#34 1990 7 34 1
#35 1990 8 34 1
#36 1990 9 34 1
#37 1990 10 34 1
#38 1990 11 34 1
#39 1990 12 34 1
#40 1991 1 34 0
bar <- foo %>% group_by(id, year) %>% dplyr::summarize(yrtot = sum(conflict))
library(data.table)
bar %<>% ungroup() %>% group_by(id) %>% dplyr::mutate(lastyrtot=shift(yrtot, n=1))
foo %<>% left_join( bar) %>%
dplyr::mutate(conflict_mutate = ifelse(yrtot>1 & lastyrtot >1,1,0) )
foo %<>% dplyr::mutate(conflict_mutate = ifelse(is.na(lastyrtot), conflict, conflict_mutate)) %>% select(-yrtot, -lastyrtot)
#R> print(foo, n=40)
## A tibble: 40 x 5
# year month id conflict conflict_mutate
# <int> <int> <int> <int> <dbl>
# 1 1989 6 33 0 0
# 2 1989 7 33 0 0
# 3 1989 8 33 1 1
# 4 1989 9 33 1 1
# 5 1989 10 33 1 1
# 6 1989 11 33 1 1
# 7 1989 12 33 1 1
# 8 1990 1 33 0 1
# 9 1990 3 33 0 1
#10 1990 3 33 0 1
#11 1990 4 33 0 1
#12 1990 5 33 1 1
#13 1990 6 33 1 1
#14 1990 7 33 1 1
#15 1990 8 33 1 1
#16 1990 9 33 1 1
#17 1990 10 33 1 1
#18 1990 11 33 1 1
#19 1990 12 33 1 1
#20 1991 1 33 0 0
#21 1989 6 34 0 0
#22 1989 7 34 0 0
#23 1989 8 34 1 1
#24 1989 9 34 1 1
#25 1989 10 34 1 1
#26 1989 11 34 1 1
#27 1989 12 34 1 1
#28 1990 1 34 0 1
#29 1990 3 34 0 1
#30 1990 3 34 0 1
#31 1990 4 34 0 1
#32 1990 5 34 1 1
#33 1990 6 34 1 1
#34 1990 7 34 1 1
#35 1990 8 34 1 1
#36 1990 9 34 1 1
#37 1990 10 34 1 1
#38 1990 11 34 1 1
#39 1990 12 34 1 1
#40 1991 1 34 0 0
答案 1 :(得分:0)
这是一个肮脏的解决方案“我在lapply
和map
上遇到了一些问题”,我敢肯定有人会提出来并提出一些整洁的东西
for(j in seq_along(dflist)){
for (i in seq_along(dflist[[j]])){
#print(paste(j,i))
if(i==1){
dflist[[j]][[i]] <- mutate(dflist[[j]][[i]], val = i, conflict_mutated = ifelse(cumany(conflict),1,0)) #make sure column
} else{
dflist[[j]][[i]] <- mutate(dflist[[j]][[i]], val = i-1,
conflict_mutated = ifelse(any(conflict)==1 & any(dflist[[j]][[i-1]]['conflict'])==1 ,1,0) )
}
}
}
df <- read.table(text="
year month id conflict
1989 6 33 0
1989 7 33 0
1989 8 33 1
1989 9 33 1
1989 10 33 1
1989 11 33 1
1989 12 33 1
1990 1 33 0
1990 3 33 0
1990 3 33 0
1990 4 33 0
1990 5 33 1
1990 6 33 1
1990 7 33 1
1990 8 33 1
1990 9 33 1
1990 10 33 1
1990 11 33 1
1990 12 33 1
",header=T, stringsAsFactors = F)
df1 <- df
df1$id<- 44
df2 <- rbind(df,df1) %>% arrange(id, year, month) #For split and cumany to work correctly
dflist <- lapply( (df2 %>% split(., .[,'id'])), function(xtbl) xtbl %>% split(., .[,'year']))