我在“R”中有2个数据集。
第一个数据库包含特定日期:
Value Date
# 20 2017-10-19
# 19 2017-10-23
# 19 2017-11-03
# 20 2017-11-10
第二个包含过去5年的股票指数水平
Date Index
# 2017-11-10 13.206,35
# 2017-11-03 13.378,96
# 2017-10-25 13.404,58
# 2017-10-19 13.517,98
现在我想通过从第一个数据集“DB”中搜索日期并从第二个数据集“Hist”为该日期添加正确的Index值来合并。
我所做的是使用left_join函数:
DB <- left_join(DB, Hist, by = "Date")
问题是第一个数据集中的某些日期是公共假日,其中第二个数据集“历史”中没有数据可用。所以我有一些“NA”。
Value Date Index
# 20 2017-10-19 13.517,98
# 19 2017-10-23 NA
# 19 2017-11-03 13.378,96
# 20 2017-11-10 13.206,35
我正在寻找的是获取下一个可用日期的值,而不是添加NA。
示例:不使用2017-10-25索引(2天后)添加NA
Value Date Index
# 20 2017-10-19 13.517,98
# 19 2017-10-23 13.404,58
# 19 2017-11-03 13.378,96
# 20 2017-11-10 13.206,35
有人有个主意吗?提前谢谢!
答案 0 :(得分:0)
解决方案可能是
library(dplyr)
library(rlang)
clean_df <- function(df) {
ix <- which(is.na(df$Index))
df$Index[ix] <- df$Index[ix + 1]
filter(df, !is.na(.data$Value))
}
full_join(DB, Hist) %>%
arrange(Date) %>%
clean_df()
答案 1 :(得分:0)
以下是一个选项。它使用full_join
,然后使用fill
函数来计算缺失值。
library(tidyverse)
DB_final <- DB %>%
full_join(Hist, by = "Date") %>%
arrange(Date) %>%
fill(Index, .direction = "up") %>%
filter(!is.na(Value))
DB_final
# Value Date Index
# 1 20 2017-10-19 13.517,98
# 2 19 2017-10-23 13.404,58
# 3 19 2017-11-03 13.378,96
# 4 20 2017-11-10 13.206,35
但是,用户需要提前知道填充方向(up
或down
)。如果用户不知道,那可能没用。
这是另一个选项,我认为它更强大。它将使用距离最近的日期Index
来计算缺失值。
# Collect all dates
Date_vec <- sort(unique(c(DB$Date, Hist$Date)))
# Create a distance matrix based on dates than convert to a data frame
dt <- Date_vec %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rowid_to_column(var = "ID") %>%
gather(ID2, Value, -ID) %>%
mutate(ID2 = as.integer(ID2)) %>%
filter(ID != ID2) %>%
arrange(ID, Value) %>%
group_by(ID) %>%
slice(1) %>%
select(-Value)
dt$ID <- Date_vec[dt$ID]
dt$ID2 <- Date_vec[dt$ID2]
names(dt) <- c("Date1", "Date2")
dt
# # A tibble: 5 x 2
# # Groups: ID [5]
# Date1 Date2
# <date> <date>
# 1 2017-10-19 2017-10-23
# 2 2017-10-23 2017-10-25
# 3 2017-10-25 2017-10-23
# 4 2017-11-03 2017-11-10
# 5 2017-11-10 2017-11-03
dt
显示所有日期的最近日期。
加入DB
和dt
,然后根据不同的日期列加入Hist
两次。
DB2 <- DB %>% left_join(dt, by = c("Date" = "Date1"))
DB3 <- DB2 %>%
left_join(Hist, by = "Date") %>%
left_join(Hist, by = c("Date2" = "Date"))
DB3
# Value Date Date2 Index.x Index.y
# 1 20 2017-10-19 2017-10-23 13.517,98 <NA>
# 2 19 2017-10-23 2017-10-25 <NA> 13.404,58
# 3 19 2017-11-03 2017-11-10 13.378,96 13.206,35
# 4 20 2017-11-10 2017-11-03 13.206,35 13.378,96
如果Index.x
中有值,请使用该值,否则请使用Index.y
中的值。
DB4 <- DB3 %>%
mutate(Index = ifelse(is.na(Index.x), Index.y, Index.x)) %>%
select(Value, Date, Index)
DB4
# Value Date Index
# 1 20 2017-10-19 13.517,98
# 2 19 2017-10-23 13.404,58
# 3 19 2017-11-03 13.378,96
# 4 20 2017-11-10 13.206,35
DB4
是最终输出。
数据强>
DB <- structure(list(Value = c(20L, 19L, 19L, 20L), Date = structure(c(17458,
17462, 17473, 17480), class = "Date")), class = "data.frame", .Names = c("Value",
"Date"), row.names = c(NA, -4L))
Hist <- structure(list(Date = structure(c(17480, 17473, 17464, 17458), class = "Date"),
Index = c("13.206,35", "13.378,96", "13.404,58", "13.517,98"
)), class = "data.frame", .Names = c("Date", "Index"), row.names = c(NA,
-4L))
答案 2 :(得分:0)
你做了什么,加上as.Date()来格式化日期:
library(data.table)
library(dplyr)
DB = data.table(
Value = c(20,19,19,29),
Date = c("2017-10-19","2017-10-23","2017-11-03","2017-11-10")
)
Hist = data.table(
Date = c("2017-11-10","2017-11-03","2017-10-25","2017-10-19"),
Index = c("13.206,35","13.378,96","13.404,58","13.517,98")
)
DB[, Date := as.Date(Date)]
Hist[, Date := as.Date(Date)]
DB <- left_join(DB,Hist,by="Date") %>% as.data.table()
现在执行以下步骤:
# Get rows which are missing an Index.
DB_na <- DB[is.na(Index),]
DB <- DB[!is.na(Index),]
# Build function to find appropriate Index, given an na_date.
get_na_index <- function(na_date) {
bigger_dates = DB[Date>na_date,]
index = bigger_dates[which.min(other_dates-na_date), Index]
return(index)
}
# Use apply() to perform row-wise operation.
DB_na$Index <- apply(matrix(DB_na$Date), 1, get_na_index)
# Combine rows
DB <- rbind(DB, DB_na) %>% arrange(Date)
输出:
DB
Value Date Index
1 20 2017-10-19 13.517,98
2 19 2017-10-23 13.378,96
3 19 2017-11-03 13.378,96
4 29 2017-11-10 13.206,35