假设我们在R中有这个数据框:
df <- data.frame(id = c(rep(1,5), rep(2, 3), rep(3, 4), rep(4, 2)), brand = c("A", "B", "A", "D", "Closed", "B", "C", "D", "D", "A", "B", "Closed", "C", "Closed"))
> df
# id brand
#1 1 A
#2 1 B
#3 1 A
#4 1 D
#5 1 Closed
#6 2 B
#7 2 C
#8 2 D
#9 3 D
#10 3 A
#11 3 B
#12 3 Closed
#13 4 C
#14 4 Closed
我想创建一个新变量来表示从当前行到下一行的品牌列中的更改,但这必须仅在每个ID号中发生。
我创建了新列:
df$brand_chg <- ""
这个循环正确完成了我想要做的事情:
for (i in 1:nrow(df)) {
j <- i + 1
if(j > nrow(df)) next #to prevent error in very last row
if (df[i,'id'] != df[j, 'id']) next #to skip loop when id changes
df[i,'brand_chg'] <- paste(df[i,'brand'], df[j,'brand'], sep = "->")
#populating concatenation
}
#Results:
# id brand brand_chg
#1 1 A A->B
#2 1 B B->A
#3 1 A A->D
#4 1 D D->Closed
#5 1 Closed
#6 2 B B->C
#7 2 C C->D
#8 2 D
#9 3 D D->A
#10 3 A A->B
#11 3 B B->Closed
#12 3 Closed
#13 4 C C->Closed
#14 4 Closed
但是,在一个287k行的表上,此循环至少需要10分钟才能运行。有没有人知道更快的方法来实现这种连接?
谢谢,感谢您的见解。
答案 0 :(得分:5)
使用dplyr
包:
library(dplyr)
df %>% group_by(id) %>%
mutate(brand_chg = ifelse(seq_along(brand) == n(),
"",
paste(brand, lead(brand), sep = "->")))
答案 1 :(得分:1)
还有dplyr,只是有点不同,没有更好!使用is.na而不是n == n()
library(dplyr)
df %>%
group_by(id) %>%
mutate(change = if_else(is.na(lead(brand)), "", paste0(brand,"->", lead(brand))))
答案 2 :(得分:1)
以下是使用data.table
library(data.table)
setDT(df)[, brand_chg := paste(brand, shift(brand, type = "lead"), sep="->"), id]
df[df[, .I[.N] , id]$V1, brand_chg := ""]
df
# id brand brand_chg
# 1: 1 A A->B
# 2: 1 B B->A
# 3: 1 A A->D
# 4: 1 D D->Closed
# 5: 1 Closed
# 6: 2 B B->C
# 7: 2 C C->D
# 8: 2 D
# 9: 3 D D->A
#10: 3 A A->B
#11: 3 B B->Closed
#12: 3 Closed
#13: 4 C C->Closed
#14: 4 Closed
或紧凑型选项
setDT(df)[, brand_chg := c(paste(brand[-.N], brand[-1], sep="->"), ""), id]