Question

I have ~ 4 million rows of personal data that looks like the following:

names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA_real_
df1 <- data.frame(names,luckyToday,luckyYesterday)

df1
# names luckyToday luckyYesterday
# 1   Peter          0             NA
# 2   Peter          0             NA
# 3   Peter          0             NA
# 4   Peter         NA             NA
# 5   Peter          0             NA
# 6   Peter          0             NA
# 7   Peter          1             NA
# 8    Lisa         NA             NA
# 9    Bert          1             NA
# 10 Carine         NA             NA
# 11 Carine          0             NA
# 12 Carine          0             NA
# 13 Carine          0             NA
# 14 Carine          1             NA
# 15 Carine          1             NA

The data contains observations of people (some with 1 observation, some with more) and their luckiness (1=lucky, 0=unlucky, NA=no information). As kind of a lagged variable, I want to introduce a new variable ("luckyYesterday") that tells me if the person was lucky during the last observation or not. So I want the data look like this:

df2
# names luckyToday luckyYesterday
# 1   Peter          0             NA
# 2   Peter          0              0
# 3   Peter          0              0
# 4   Peter         NA              0
# 5   Peter          0              0
# 6   Peter          0              0
# 7   Peter          1              0
# 8    Lisa         NA             NA
# 9    Bert          1             NA
# 10 Carine         NA             NA
# 11 Carine          0              0
# 12 Carine          0              0
# 13 Carine          0              0
# 14 Carine          1              0
# 15 Carine          1              1

I know that R is not the perfect programm to apply such data wrangling, but it is necessary.

I want to consider the following things:

Consider that these are unique persons I have observations from, dont mix up observations of different people.
Lower observations are newer. The last observation of a unique person is the latest one (row #7 is the latest observation of Peter).
If there is a NA in the "luckyToday" column (for example row #4), I dont want the corresponding "luckyYesterdayNA" (row #5) filled with a NA, too. I rather want the corresponding cell get filled with the value of the newest "non-NA" observation in the "luckyToday" column, which is a "0" in the example data. If the previous observations are only NAs, then I assume that they equal the latest observation. (if I have 9 NAs and the 10th observation is a "1", then I want to assume that observations 1-9 are also "1".
If there is only one observation per person, I want a NA in the "luckyYesterday" column anyway since I have no knowledge about the previous state of luck.

I tried it by myself with 2 for-loops, but I takes ages on my data with over 4 million observations. Can anyone help my with a faster solution such as with data.table or an apply function, please? I would appreciate that so much!

Cheers

Answer 1

You can use the shift function from data.table to observe yesterday and na.locf function from zoo package to fill NA with yesterday or tomorrow depending on if the fromLast parameter is F or T, and also group by the name if you don't want to mix observations of different people:

library(data.table); library(zoo)
setDT(df1)[,luckyYesterday := shift(na.locf(luckyToday, fromLast = TRUE)), names]

df1
#      names luckyToday luckyYesterday
#  1:  Peter          0             NA
#  2:  Peter          0              0
#  3:  Peter          0              0
#  4:  Peter         NA              0
#  5:  Peter          0              0
#  6:  Peter          0              0
#  7:  Peter          1              0
#  8:   Lisa         NA             NA
#  9:   Bert          1             NA
# 10: Carine         NA             NA
# 11: Carine          0              0
# 12: Carine          0              0
# 13: Carine          0              0
# 14: Carine          1              0
# 15: Carine          1              1

Answer 2

names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter", 
    "Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA
df1 <- data.frame(names,luckyToday,luckyYesterday)

# New code
library(data.table) 
data.table(df1)[,list(luckyToday, c(NA, luckyToday[1:(.N-1)])),by=list(names)]

Conditionally fill column by other columns values

2 个答案: