我有两个包含以下数据的数据集。
maindata = data.frame(eventid=c(1:10),
district=c(rep("lucknow",2),rep("allahabad",1), rep("kanpur", 2)),
date = c(rep("2018-01-01", 2), rep("2018-01-02", 1), rep("2018-01-03", 2)))
weather = data.frame(district=c(rep("lucknow", 4), rep("allahabad", 3), rep("kanpur", 3)),
date = c(rep("2017-01-01", 4), rep("2017-01-02", 3), rep("2017-01-03", 3)),
temperature=c(rep("19.3",2),rep("22.1",1), rep("24.1", 2)))
很少考虑:
我尝试过什么:(做一些愚蠢的转换......会修复它们)
weather$District<-as.factor(tolower(weather$District))
weather$Date<-as.Date(as.character(weather$Date),format="%m/%d/%Y")
maindata$md<-strftime(data$createDate, "%m-%d")
weather$mdr<-strftime(weather$Date, "%m-%d")
maindata<-left_join(maindata, weather, by = c("md" = "mdr", "district" = "District"))
最终预期答案类似于以下maindata
eventid district date temperature
1 lucknow 2018-01-01 19.3
2 lucknow 2018-01-01 19.3
3 allahabad 2018-01-03 24.1
4 kanpur 2018-01-03 NA
5 kanpur 2018-01-02 22.1
6 lucknow 2018-01-01 19.3
7 lucknow 2018-01-01 19.3
8 allahabad 2018-01-03 24.1
9 kanpur 2018-01-03 NA
10 kanpur 2018-01-02 22.1
任何人都可以帮忙!!!
答案 0 :(得分:1)
也许是这样的(使用更新后的数据)
library(tidyverse)
df1 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")) %>%
left_join(df2 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")), by = c("date1" = "date1", "district" = "dist")) %>%
select(-date1, - date.y) %>%
rename(date = date.x) %>%
filter(!duplicated(eventid))
#output
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 <NA>
6 6 2017-01-02 dist-2 <NA>
7 7 2017-01-02 dist-2 <NA>
8 8 2017-01-03 dist-3 24.10
9 9 2017-01-03 dist-3 24.10
10 10 2017-01-03 dist-3 24.10
将两个数据框中的日期转换为POSIXct
,制作一个%d/%m
列并按其和区域加入,然后清理
答案 1 :(得分:1)
我不理解你合并的逻辑规则;具体来说,我不知道date
是如何进入的。
通过简单地将date
与df1$district
匹配,完全可以在不考虑df2$dist
的情况下重现您的预期输出:
library(tidyverse);
left_join(df1, df2, by = c("district" = "dist")) %>%
distinct() %>%
select(-date.y)
# eventid date.x district temp
#1 1 2017-01-01 dist-1 19.3
#2 2 2017-01-01 dist-1 19.3
#3 3 2017-01-01 dist-1 19.3
#4 4 2017-01-01 dist-1 19.3
#5 5 2017-01-02 dist-2 22.1
#6 6 2017-01-02 dist-2 22.1
#7 7 2017-01-02 dist-2 22.1
#8 8 2017-01-03 dist-3 24.10
#9 9 2017-01-03 dist-3 24.10
#10 10 2017-01-03 dist-3 24.10
您是否可以提供更能代表您尝试做的样本数据,以及合并date
的角色/重要性变得清晰的位置?
答案 2 :(得分:1)
快速说明 - 在向SO寻求帮助之前,您应该将试验发布到解决方案中。
回答 -
您应该使用的是默认情况下在R。
中可用的merge
功能
在重现您提供的数据帧后 - 尝试下面的代码块
#Since dates doesn't matter, df2 could be changed to a new df with only temp
df3 <- df2[,c("dist","temp")]
df3 <- unique(df3)
df4 <- merge(df1,df3,by.x = "district",by.y = "dist",all.x = T)
重复数据删除已经完成,以避免为df1和df2中的每个日期组合创建大量行。
all.x = T
确保您获得左连接(df1的所有行都出现在最终输出中)
答案 3 :(得分:1)
也许你想要这个。
df2[, 2] <- as.numeric(as.character(df2[, 2]))
m1 <- merge(df1, df2, by.x = "district", by.y = "dist", all.x = TRUE)[-5]
names(m1)[3] <- "date"
m1 <- unique(m1[, c(2, 3, 1, 4)])
rownames(m1) <- NULL
> m1
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 22.1
6 6 2017-01-02 dist-2 22.1
7 7 2017-01-02 dist-2 22.1
8 8 2017-01-03 dist-3 24.1
9 9 2017-01-03 dist-3 24.1
10 10 2017-01-03 dist-3 24.1