Question

我有数十万行，其中大多数行都缺少值（第2列）。基于主键（第1列），我可以假设可以使用与该键相关联的值来估算缺失值。一个例子是必要的。

Primary Key Date       Date.Impute
123         ""  
123         ""  
123         02/02/2017  
1234        ""  
1234        02/03/2017  
1234        ""  
12345       01/01/2017  
12345       ""

所有订单＆＃34; 123＆＃34;日期是＆＃34; 02/02 / 2017＆＃34;。所有订单＆＃34; 1234＆＃34;日期是＆＃34; 02/03/2017＆＃34;等等。

在R中使用或不使用索引匹配功能，如何填写第3列中第2列的所有缺失字段？最终结果如下：

Primary Key Date          Date.Impute
123         ""            02/02/2017
123         ""            02/02/2017
123         02/02/2017    02/02/2017    
1234        ""            02/03/2017
1234        02/03/2017    02/03/2017
1234        ""            02/03/2017
12345       01/01/2017    01/01/2017
12345       ""            01/01/2017

我知道如何在Excel中执行此操作并很乐意分享它，但我想了解如何在R中执行此操作。任何帮助将不胜感激。谢谢。

Answer 1

在基数R中，你可以简单地做

merge(df, unique(df[df$Date!="",]), by="Primary.Key", all.x = T)

#  Primary.Key     Date.x     Date.y
#1         123            02/02/2017
#2         123            02/02/2017
#3         123 02/02/2017 02/02/2017
#4        1234            02/03/2017
#5        1234 02/03/2017 02/03/2017
#6        1234            02/03/2017
#7       12345 01/01/2017 01/01/2017
#8       12345            01/01/2017

Answer 2

可重复数据

我添加了额外的行Primary.Key == 123456没有一个Date值

library(lubridate)
df <- data.frame(Primary.Key = c(123,123,123,1234,1234,1234,12345,12345,123456),
         Date=mdy(NA,NA,"02/02/2017",NA,"02/03/2017",NA,"01/01/2017",NA,NA),
         Date.Impute=as.Date(rep(NA,9)), stringsAsFactors=F)

dplyr和purrr解决方案

使用ifelse来处理Primary.Key == 123456之类的条目，而不会有Date个值。我也从使用unique更改为tail(sort(),1)

library(dplyr)
library(purrr)
L <- split(df, df$Primary.Key)           # split by Primary.Key groups into list
df1 <- map_df(L, ~.x %>% mutate(Date.Impute = ifelse(length(tail(sort(Date),1))==0, as.character(NA), as.character(tail(sort(Date),1)))))
df2 <- df1 %>% mutate(Date.Impute = ymd(Date.Impute))

输出

  Primary.Key       Date Date.Impute
1         123       <NA>  2017-02-02
2         123       <NA>  2017-02-02
3         123 2017-02-02  2017-02-02
4        1234       <NA>  2017-02-03
5        1234 2017-02-03  2017-02-03
6        1234       <NA>  2017-02-03
7       12345 2017-01-01  2017-01-01
8       12345       <NA>  2017-01-01
9      123456       <NA>        <NA>

Answer 3

这可能会有点慢......但至少可以解决一些问题：

for (key in unique(df$Primary_Key)) {
  keyrows <- df$Primary_Key == key
  key_d <- df[keyrows & df$Date != "", "Date"][1]
  df[keyrows, "Date.impute"] <- key_d
}

df

  Primary_Key       Date Date.impute
1         123             02/02/2017
2         123             02/02/2017
3         123 02/02/2017  02/02/2017
4        1234             02/03/2017
5        1234 02/03/2017  02/03/2017
6        1234             02/03/2017
7       12345 01/01/2017  01/01/2017
8       12345             01/01/2017

它确实处理了一个主键有两个日期的情况，只需选择出现的第一个日期。

数据：

df <- data.frame(Primary_Key = c(rep(123L, 3), rep(1234L, 3), rep(12345L, 2)), 
                 Date = c("", "", "02/02/2017", "", "02/03/2017", "", 
                          "01/01/2017", ""), 
                 Date.impute = "",
                 stringsAsFactors = FALSE)

为R中的唯一值索引非空值

3 个答案:

可重复数据

dplyr和purrr解决方案

输出