根据标准合并/加入数据框/表 - >或者<

时间:2016-04-08 14:25:56

标签: r merge dataframe data.table

我有一个包含每周数据的数据框。每个部分有大约104周的数据,总共有83个部分。

我有第二个数据框,其中包含按部分的开始和结束周,我想过滤主数据框。

在两个表中,周是年和周的组合,例如201501,总是从第1周到第52周。

因此,在下面的示例中,我希望在201401年到201404年之间按照2015年到201603周的B部分过滤A部分。

我最初认为我可以在Weeks_Filter数据框中添加一个额外的列,该数据框是每个部分的周开始和结束的序号(每周重复每行),然后合并2个表并保持来自Weeks_Filter表的所有数据(all.y = TRUE)因为这对我做过的小样本起作用,但我不知道如何添加连续周,因为它们可以跨越不同年份。

Week <- c("201401","201402","201403","201404","201405", "201451", "201552", "201601", "201602", "201603")
Section <- c(rep("A",5),rep("B",5))
df <- data.frame(cbind(Week, Section))

Section <- c("A", "B")
Start <- c("201401","201551")
End <- c("201404","201603")
Weeks_Filter <- data.frame(cbind(Section, Start, End))

4 个答案:

答案 0 :(得分:4)

data.table的最新development version添加了非等联接(在较旧版本中,您可以使用foverlaps):

setDT(df) # convert to data.table in place
setDT(Weeks_Filter)

# fix the column types - you have factors currently, converting to integer
df[, Week := as.integer(as.character(Week))]
Weeks_Filter[, `:=`(Start = as.integer(as.character(Start)),
                    End   = as.integer(as.character(End)))]

# the actual magic
df[df[Weeks_Filter, on = .(Section, Week >= Start, Week <= End), which = T]]
#     Week Section
#1: 201401       A
#2: 201402       A
#3: 201403       A
#4: 201404       A
#5: 201552       B
#6: 201601       B
#7: 201602       B
#8: 201603       B

答案 1 :(得分:1)

使用dplyr即可

  • 合并您的数据框
  • 按部门分组
  • 基于“开始”和“结束”列的过滤器

一个问题是你的'周'是字符,并且成为你编码它们的因素。我选择了快捷方式并将它们设为数字,但我建议使用lubridate来生成这些正确的Date类向量。

library(dplyr)
tempdf <- full_join(df, Weeks_Filter)
tempdf$Week <- as.numeric(as.character(tempdf$Week))
tempdf$Start <- as.numeric(as.character(tempdf$Start))
tempdf$End <- as.numeric(as.character(tempdf$End))


tempdf_filt <- tempdf %>%
  group_by(Section) %>%
  filter(Week >= Start,
         Week <= End)

您的数据中似乎存在“201451”应为“201551”的问题,但否则会返回您想要的内容:

> tempdf_filt
Source: local data frame [8 x 4]
Groups: Section [2]

    Week Section  Start    End
   (dbl)  (fctr)  (dbl)  (dbl)
1 201401       A 201401 201404
2 201402       A 201401 201404
3 201403       A 201401 201404
4 201404       A 201401 201404
5 201552       B 201551 201603
6 201601       B 201551 201603
7 201602       B 201551 201603
8 201603       B 201551 201603

答案 2 :(得分:0)

创建所有所需周数的向量可能适用于过滤器。以下是使用基数R的粗略示例:

# get weeks
allWeeks <- as.character(1:52)
allWeeks <- ifelse(nchar(allWeeks)==1, paste0("0",allWeeks), allWeeks)
# get all year-weeks
allWeeks <- paste0(2014:2015, allWeeks)

# filter vector to select desired weeks
keepWeeks <- keepWeeks[grep("201(40[1-4]|55[12]|60[123]))", allWeeks)]

dfKeeper <- df[df$Week %in% keepWeeks,]

我尝试构建一个可以捕获所需周期的正则表达式,但您可能需要稍微调整一下。

答案 3 :(得分:-2)

require(data.table)

df <- merge(df, Weeks_Filter)
df[, -1] <- apply(df[, -1], 2, function(x) as.numeric(as.character(x)))
df <- data.table(df)

df[Week >= Start & Week <= End, .SD, by = Section]

输出是,

   Section  Start    End   Week
1:       A 201401 201404 201401
2:       A 201401 201404 201402
3:       A 201401 201404 201403
4:       A 201401 201404 201404
5:       B 201551 201603 201552
6:       B 201551 201603 201601
7:       B 201551 201603 201602
8:       B 201551 201603 201603