I have ~ 4 million rows of personal data that looks like the following:
names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA_real_
df1 <- data.frame(names,luckyToday,luckyYesterday)
df1
# names luckyToday luckyYesterday
# 1 Peter 0 NA
# 2 Peter 0 NA
# 3 Peter 0 NA
# 4 Peter NA NA
# 5 Peter 0 NA
# 6 Peter 0 NA
# 7 Peter 1 NA
# 8 Lisa NA NA
# 9 Bert 1 NA
# 10 Carine NA NA
# 11 Carine 0 NA
# 12 Carine 0 NA
# 13 Carine 0 NA
# 14 Carine 1 NA
# 15 Carine 1 NA
The data contains observations of people (some with 1 observation, some with more) and their luckiness (1=lucky, 0=unlucky, NA=no information). As kind of a lagged variable, I want to introduce a new variable ("luckyYesterday") that tells me if the person was lucky during the last observation or not. So I want the data look like this:
df2
# names luckyToday luckyYesterday
# 1 Peter 0 NA
# 2 Peter 0 0
# 3 Peter 0 0
# 4 Peter NA 0
# 5 Peter 0 0
# 6 Peter 0 0
# 7 Peter 1 0
# 8 Lisa NA NA
# 9 Bert 1 NA
# 10 Carine NA NA
# 11 Carine 0 0
# 12 Carine 0 0
# 13 Carine 0 0
# 14 Carine 1 0
# 15 Carine 1 1
I know that R is not the perfect programm to apply such data wrangling, but it is necessary.
I want to consider the following things:
I tried it by myself with 2 for-loops, but I takes ages on my data with over 4 million observations. Can anyone help my with a faster solution such as with data.table or an apply function, please? I would appreciate that so much!
Cheers
答案 0 :(得分:2)
You can use the shift
function from data.table
to observe yesterday and na.locf
function from zoo
package to fill NA with yesterday or tomorrow depending on if the fromLast
parameter is F or T, and also group by the name if you don't want to mix observations of different people:
library(data.table); library(zoo)
setDT(df1)[,luckyYesterday := shift(na.locf(luckyToday, fromLast = TRUE)), names]
df1
# names luckyToday luckyYesterday
# 1: Peter 0 NA
# 2: Peter 0 0
# 3: Peter 0 0
# 4: Peter NA 0
# 5: Peter 0 0
# 6: Peter 0 0
# 7: Peter 1 0
# 8: Lisa NA NA
# 9: Bert 1 NA
# 10: Carine NA NA
# 11: Carine 0 0
# 12: Carine 0 0
# 13: Carine 0 0
# 14: Carine 1 0
# 15: Carine 1 1
答案 1 :(得分:2)
names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter",
"Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA
df1 <- data.frame(names,luckyToday,luckyYesterday)
# New code
library(data.table)
data.table(df1)[,list(luckyToday, c(NA, luckyToday[1:(.N-1)])),by=list(names)]