在新列中连接当前行和后续行

时间:2017-05-05 21:07:12

标签: r loops dataframe concatenation

假设我们在R中有这个数据框:

df <- data.frame(id = c(rep(1,5), rep(2, 3), rep(3, 4), rep(4, 2)), brand = c("A", "B", "A", "D", "Closed", "B", "C", "D", "D", "A", "B", "Closed", "C", "Closed"))

> df
#   id  brand
#1   1      A
#2   1      B
#3   1      A
#4   1      D
#5   1 Closed
#6   2      B
#7   2      C
#8   2      D
#9   3      D
#10  3      A
#11  3      B
#12  3 Closed
#13  4      C
#14  4 Closed

我想创建一个新变量来表示从当前行到下一行的品牌列中的更改,但这必须仅在每个ID号中发生。

我创建了新列:

df$brand_chg <- ""

这个循环正确完成了我想要做的事情:

for (i in 1:nrow(df)) {

    j <- i + 1

    if(j > nrow(df)) next #to prevent error in very last row

    if (df[i,'id'] != df[j, 'id']) next #to skip loop when id changes

    df[i,'brand_chg'] <- paste(df[i,'brand'], df[j,'brand'], sep = "->") 
    #populating concatenation
}

#Results:
#   id  brand brand_chg
#1   1      A      A->B
#2   1      B      B->A
#3   1      A      A->D
#4   1      D D->Closed
#5   1 Closed          
#6   2      B      B->C
#7   2      C      C->D
#8   2      D          
#9   3      D      D->A
#10  3      A      A->B
#11  3      B B->Closed
#12  3 Closed          
#13  4      C C->Closed
#14  4 Closed 

但是,在一个287k行的表上,此循环至少需要10分钟才能运行。有没有人知道更快的方法来实现这种连接?

谢谢,感谢您的见解。

3 个答案:

答案 0 :(得分:5)

使用dplyr包:

library(dplyr)

df %>% group_by(id) %>% 
    mutate(brand_chg = ifelse(seq_along(brand) == n(), 
                              "", 
                              paste(brand, lead(brand), sep = "->")))

答案 1 :(得分:1)

还有dplyr,只是有点不同,没有更好!使用is.na而不是n == n()

library(dplyr)
df %>% 
  group_by(id) %>%
  mutate(change = if_else(is.na(lead(brand)), "", paste0(brand,"->", lead(brand))))

答案 2 :(得分:1)

以下是使用data.table

的选项
library(data.table)
setDT(df)[, brand_chg := paste(brand, shift(brand, type = "lead"), sep="->"), id]
df[df[, .I[.N] , id]$V1, brand_chg := ""]
df
#    id  brand brand_chg
# 1:  1      A      A->B
# 2:  1      B      B->A
# 3:  1      A      A->D
# 4:  1      D D->Closed
# 5:  1 Closed          
# 6:  2      B      B->C
# 7:  2      C      C->D
# 8:  2      D          
# 9:  3      D      D->A
#10:  3      A      A->B
#11:  3      B B->Closed
#12:  3 Closed          
#13:  4      C C->Closed
#14:  4 Closed          

或紧凑型选项

setDT(df)[, brand_chg := c(paste(brand[-.N], brand[-1], sep="->"), ""), id]