使用R

时间:2017-05-20 14:10:44

标签: r missing-data

enter image description here

我有上表。我想填写交易ID下的缺失值。填写此算法的算法如下:

  1. 用户ID" kenn1"有两个丢失的交易ID,可以使用其他两个交易ID t1和t4填写。

  2. 要选择在t1和t4之间使用哪一个,我会查看事件时间。第一个缺失值发生在9:30,距离t1 30分钟,距离t4 20分钟。由于t4更接近这个缺失值,因此它将被填充为t4。类似地,对于第4行中的缺失值,它距离t1 45分钟,距离t4 5分钟。因此它将替换为t4。

  3. 用户ID" kenn2"的缺失值的类似方法 enter image description here
  4. 我如何在R?

    中执行此操作

2 个答案:

答案 0 :(得分:0)

可能有一个更好的解决方案,但我用data.table编写了这个解决方案:

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                UserId = c(rep("kenn1",4), rep("kenn2",6)),
                EventTime = as.POSIXct(
                  c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                    "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
                )

transactionTimes <- raw[!is.na(TransactionID), .(TransactionID, EventTime)]
raw[, Above := na.locf(TransactionID, na.rm = F), UserId]
raw[, Below := na.locf(TransactionID, na.rm = F, fromLast = T), UserId]
raw <- merge(raw, transactionTimes[, .(Above = TransactionID, AboveTime = EventTime)], by="Above", all.x = T)
raw <- merge(raw, transactionTimes[, .(Below = TransactionID, BelowTime = EventTime)], by="Below", all.x = T)
raw[, AboveDiff := EventTime - AboveTime]
raw[, BelowDiff := BelowTime - EventTime]
raw[is.na(TransactionID) & is.na(AboveDiff), TransactionID := Below]
raw[is.na(TransactionID) & is.na(BelowDiff), TransactionID := Above]
raw[is.na(TransactionID), TransactionID := ifelse(AboveDiff <= BelowDiff, Above, Below)]
raw <- raw[, .(Event, TransactionID, UserId, EventTime)]
rm(transactionTimes)

答案 1 :(得分:0)

data.table的另一种解决方案。

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                  TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                  UserId = c(rep("kenn1",4), rep("kenn2",6)),
                  EventTime = as.POSIXct(
                    c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                      "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
)

#subset a rows without duplicates
raw_notNA <- raw[!is.na(TransactionID)] 
# merge the subset data with original (this will duplicate rows of originals with candiate rows)
merged <- merge(raw, raw_notNA, all.x = T, by = "UserId", allow.cartesian=TRUE) 
# calcuate time difference between original and candiate rows
merged[, DiffTime := abs(EventTime.x - EventTime.y)]
# create new Transaction IDs from the closest event 
merged[, NewTransactionID := TransactionID.y[DiffTime == min(DiffTime)], by = Event.x]
# remove the duplicaetd rows, and delete unnecesary columns
output <- merged[, .SD[1], by = Event.x][, list(Event.x, NewTransactionID, UserId, EventTime.x)]

names(output) <- names(raw)
print(output)

受到这个问题答案的启发(你的问题不重复,只是类似)

R - merge dataframes on matching A, B and *closest* C?