我有数十万行,其中大多数行都缺少值(第2列)。基于主键(第1列),我可以假设可以使用与该键相关联的值来估算缺失值。一个例子是必要的。
Primary Key Date Date.Impute
123 ""
123 ""
123 02/02/2017
1234 ""
1234 02/03/2017
1234 ""
12345 01/01/2017
12345 ""
所有订单" 123"日期是" 02/02 / 2017"。所有订单" 1234"日期是" 02/03/2017"等等。
在R中使用或不使用索引匹配功能,如何填写第3列中第2列的所有缺失字段?最终结果如下:
Primary Key Date Date.Impute
123 "" 02/02/2017
123 "" 02/02/2017
123 02/02/2017 02/02/2017
1234 "" 02/03/2017
1234 02/03/2017 02/03/2017
1234 "" 02/03/2017
12345 01/01/2017 01/01/2017
12345 "" 01/01/2017
我知道如何在Excel中执行此操作并很乐意分享它,但我想了解如何在R中执行此操作。任何帮助将不胜感激。谢谢。
答案 0 :(得分:6)
merge(df, unique(df[df$Date!="",]), by="Primary.Key", all.x = T)
# Primary.Key Date.x Date.y
#1 123 02/02/2017
#2 123 02/02/2017
#3 123 02/02/2017 02/02/2017
#4 1234 02/03/2017
#5 1234 02/03/2017 02/03/2017
#6 1234 02/03/2017
#7 12345 01/01/2017 01/01/2017
#8 12345 01/01/2017
答案 1 :(得分:2)
我添加了额外的行Primary.Key == 123456
没有一个Date
值
library(lubridate)
df <- data.frame(Primary.Key = c(123,123,123,1234,1234,1234,12345,12345,123456),
Date=mdy(NA,NA,"02/02/2017",NA,"02/03/2017",NA,"01/01/2017",NA,NA),
Date.Impute=as.Date(rep(NA,9)), stringsAsFactors=F)
使用ifelse
来处理Primary.Key == 123456
之类的条目,而不会有Date
个值。我也从使用unique
更改为tail(sort(),1)
library(dplyr)
library(purrr)
L <- split(df, df$Primary.Key) # split by Primary.Key groups into list
df1 <- map_df(L, ~.x %>% mutate(Date.Impute = ifelse(length(tail(sort(Date),1))==0, as.character(NA), as.character(tail(sort(Date),1)))))
df2 <- df1 %>% mutate(Date.Impute = ymd(Date.Impute))
Primary.Key Date Date.Impute
1 123 <NA> 2017-02-02
2 123 <NA> 2017-02-02
3 123 2017-02-02 2017-02-02
4 1234 <NA> 2017-02-03
5 1234 2017-02-03 2017-02-03
6 1234 <NA> 2017-02-03
7 12345 2017-01-01 2017-01-01
8 12345 <NA> 2017-01-01
9 123456 <NA> <NA>
答案 2 :(得分:1)
这可能会有点慢......但至少可以解决一些问题:
for (key in unique(df$Primary_Key)) {
keyrows <- df$Primary_Key == key
key_d <- df[keyrows & df$Date != "", "Date"][1]
df[keyrows, "Date.impute"] <- key_d
}
df
Primary_Key Date Date.impute
1 123 02/02/2017
2 123 02/02/2017
3 123 02/02/2017 02/02/2017
4 1234 02/03/2017
5 1234 02/03/2017 02/03/2017
6 1234 02/03/2017
7 12345 01/01/2017 01/01/2017
8 12345 01/01/2017
它确实处理了一个主键有两个日期的情况,只需选择出现的第一个日期。
数据:
df <- data.frame(Primary_Key = c(rep(123L, 3), rep(1234L, 3), rep(12345L, 2)),
Date = c("", "", "02/02/2017", "", "02/03/2017", "",
"01/01/2017", ""),
Date.impute = "",
stringsAsFactors = FALSE)