Question

我在“R”中有2个数据集。

第一个数据库包含特定日期：

    Value       Date   
#   20          2017-10-19 
#   19          2017-10-23 
#   19          2017-11-03 
#   20          2017-11-10

第二个包含过去5年的股票指数水平

     Date       Index
#    2017-11-10 13.206,35
#    2017-11-03 13.378,96
#    2017-10-25 13.404,58
#    2017-10-19 13.517,98

现在我想通过从第一个数据集“DB”中搜索日期并从第二个数据集“Hist”为该日期添加正确的Index值来合并。

我所做的是使用left_join函数：

DB <- left_join(DB, Hist, by = "Date")

问题是第一个数据集中的某些日期是公共假日，其中第二个数据集“历史”中没有数据可用。所以我有一些“NA”。

  Value   Date         Index
# 20      2017-10-19   13.517,98
# 19      2017-10-23   NA
# 19      2017-11-03   13.378,96
# 20      2017-11-10   13.206,35

我正在寻找的是获取下一个可用日期的值，而不是添加NA。

示例：不使用2017-10-25索引（2天后）添加NA

  Value   Date         Index
# 20      2017-10-19   13.517,98
# 19      2017-10-23   13.404,58
# 19      2017-11-03   13.378,96
# 20      2017-11-10   13.206,35

有人有个主意吗？提前谢谢！

Answer 1

解决方案可能是

library(dplyr)
library(rlang)

clean_df <- function(df) {

  ix <- which(is.na(df$Index))
  df$Index[ix] <- df$Index[ix + 1]

  filter(df, !is.na(.data$Value))

}

full_join(DB, Hist) %>%
  arrange(Date) %>%
  clean_df()

Answer 2

原始请求

以下是一个选项。它使用full_join，然后使用fill函数来计算缺失值。

library(tidyverse)

DB_final <- DB %>%
  full_join(Hist, by = "Date") %>%
  arrange(Date) %>%
  fill(Index, .direction = "up") %>%
  filter(!is.na(Value))
DB_final
#   Value       Date     Index
# 1    20 2017-10-19 13.517,98
# 2    19 2017-10-23 13.404,58
# 3    19 2017-11-03 13.378,96
# 4    20 2017-11-10 13.206,35

但是，用户需要提前知道填充方向（up或down）。如果用户不知道，那可能没用。

根据最近日期计算缺失值

这是另一个选项，我认为它更强大。它将使用距离最近的日期Index来计算缺失值。

第1步：找到最近的日期

# Collect all dates
Date_vec <- sort(unique(c(DB$Date, Hist$Date)))

# Create a distance matrix based on dates than convert to a data frame
dt <- Date_vec %>%
  dist() %>%
  as.matrix() %>%
  as.data.frame() %>%
  rowid_to_column(var = "ID") %>%
  gather(ID2, Value, -ID) %>%
  mutate(ID2 = as.integer(ID2)) %>%
  filter(ID != ID2) %>%
  arrange(ID, Value) %>%
  group_by(ID) %>%
  slice(1) %>%
  select(-Value)

dt$ID <- Date_vec[dt$ID]
dt$ID2 <- Date_vec[dt$ID2]  

names(dt) <- c("Date1", "Date2")

dt
# # A tibble: 5 x 2
# # Groups:   ID [5]
#       Date1      Date2
#      <date>     <date>
# 1 2017-10-19 2017-10-23
# 2 2017-10-23 2017-10-25
# 3 2017-10-25 2017-10-23
# 4 2017-11-03 2017-11-10
# 5 2017-11-10 2017-11-03

dt显示所有日期的最近日期。

第2步：执行多次加入

加入DB和dt，然后根据不同的日期列加入Hist两次。

DB2 <- DB %>% left_join(dt, by = c("Date" = "Date1")) 

DB3 <- DB2 %>%
  left_join(Hist, by = "Date") %>%
  left_join(Hist, by = c("Date2" = "Date")) 
DB3
#   Value       Date      Date2   Index.x   Index.y
# 1    20 2017-10-19 2017-10-23 13.517,98      <NA>
# 2    19 2017-10-23 2017-10-25      <NA> 13.404,58
# 3    19 2017-11-03 2017-11-10 13.378,96 13.206,35
# 4    20 2017-11-10 2017-11-03 13.206,35 13.378,96

第3步：完成索引

如果Index.x中有值，请使用该值，否则请使用Index.y中的值。

DB4 <- DB3 %>% 
  mutate(Index = ifelse(is.na(Index.x), Index.y, Index.x)) %>%
  select(Value, Date, Index)
DB4
#   Value       Date     Index
# 1    20 2017-10-19 13.517,98
# 2    19 2017-10-23 13.404,58
# 3    19 2017-11-03 13.378,96
# 4    20 2017-11-10 13.206,35

DB4是最终输出。

数据

DB <- structure(list(Value = c(20L, 19L, 19L, 20L), Date = structure(c(17458, 17462, 17473, 17480), class = "Date")), class = "data.frame", .Names = c("Value", "Date"), row.names = c(NA, -4L)) Hist <- structure(list(Date = structure(c(17480, 17473, 17464, 17458), class = "Date"), Index = c("13.206,35", "13.378,96", "13.404,58", "13.517,98" )), class = "data.frame", .Names = c("Date", "Index"), row.names = c(NA, -4L))

Answer 3

你做了什么，加上as.Date（）来格式化日期：

library(data.table)
library(dplyr)

DB = data.table(
  Value = c(20,19,19,29),
  Date = c("2017-10-19","2017-10-23","2017-11-03","2017-11-10")
  )

Hist = data.table(
  Date = c("2017-11-10","2017-11-03","2017-10-25","2017-10-19"),
  Index = c("13.206,35","13.378,96","13.404,58","13.517,98")
  )

DB[, Date := as.Date(Date)]
Hist[, Date := as.Date(Date)]

DB <- left_join(DB,Hist,by="Date") %>% as.data.table()

现在执行以下步骤：

# Get rows which are missing an Index.
DB_na <- DB[is.na(Index),]
DB <- DB[!is.na(Index),]

# Build function to find appropriate Index, given an na_date.
get_na_index <- function(na_date) {
  bigger_dates = DB[Date>na_date,]
  index = bigger_dates[which.min(other_dates-na_date), Index]
  return(index)
}

# Use apply() to perform row-wise operation.
DB_na$Index <- apply(matrix(DB_na$Date), 1, get_na_index)

# Combine rows
DB <- rbind(DB, DB_na) %>% arrange(Date)

输出：

DB

  Value       Date     Index
1    20 2017-10-19 13.517,98
2    19 2017-10-23 13.378,96
3    19 2017-11-03 13.378,96
4    29 2017-11-10 13.206,35

left_join（dplyr）下一个可用日期

3 个答案:

原始请求

根据最近日期计算缺失值

第1步：找到最近的日期

第2步：执行多次加入

第3步：完成索引