我的数据框很大,其中包含交易。这些字段是ID(用户ID),时间间隔(从0->的整数),创建(交易日期),到期(订阅到期的日期)和订阅(“一年”或“两年”的字符) 我需要根据基于同一行或上一行的几种情况来修改到期时的缺失值。
df <- data.frame(id = id,
interval = interval,
creation = creation,
expiry = expiry,
subscription = subscription)
df <- df[order(df[, 1], df[, 3]),]
#loop all rows of ordered df (by subsID and payment date)
for (i in 2:nrow(df)) {
# check NA of expiry
if (is.na(df[i, 4])) {
#if previous row ID and interval match, we treat this as change to subscription
if (df[i-1, 1] == df[i, 1] & df[i-1, 2] == df[i, 2]) {
df[i, 4] <- df[i-1, 4]
# otherwise it's one or two year new subscription so we add days to creation date
} else if (df[i, 5] == "one year") {
df[i, 4] <- df[i, 3] + 365
} else if (df[i, 5] == "two years") {
df[i, 4] <- df[i, 3] + 720
}
}
}
上面的代码可以解决这个问题,但是首先将NA保留为空,并且非常繁重,以至于要处理数百万行的数据帧需要很长时间。我该如何改善它并使它更像R?
答案 0 :(得分:0)
我想它可能对您有帮助:
df <- data.frame(id = id,
interval = interval,
creation = creation,
expiry = expiry,
subscription = subscription)
df <- df[order(df[, 1], df[, 3]),]
library(dplyr)
df$match_previous <- (df[, 1] == lag(df[, 1]) & df[, 2] == lag(df[, 2]))
df$match_previous[1] <- FALSE
df[, 4] <- ifelse(!is.na(df[, 4]),
df[, 4],
ifelse(df$match_previous,
lag(df[, 4]),
ifelse(df[, 5] == "one year",
df[, 3] + 365, df[, 3] + 730)))