匹配2个数据帧的多列之间的数据,以基于一个或两个匹配的列返回“匹配”值或平均值

时间:2013-12-05 03:30:11

标签: r

我有一个复杂的问题,我不知道该怎么办。我有两个名为df1的数据框:

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(NA, NA), Type2 = c(NA, NA)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

df2

structure(list(state = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Aaa", "Dd"), class = "factor"), city = structure(c(1L, 
2L, 3L, 4L, 4L, 5L, 6L), .Label = c("bb", "ccc", "ddd", "fff", 
"ggg", "hh"), class = "factor"), type = structure(c(1L, 2L, 2L, 
2L, 2L, 2L, 3L), .Label = c("Type 1", "Type 2", "Type 4"), class = "factor"), 
value = 1:7), .Names = c("state", "city", "type", "value"
), class = "data.frame", row.names = c(NA, -7L))

Dataframe df1如下所示:

State City Type1 Type2
Aaa   bb    NA    NA
Dd    e    NA    NA

和dataframe df2如下所示:

state city   type value
Aaa   bb Type 1     1
Aaa  ccc Type 2     2
Aaa  ddd Type 2     3
Dd  fff Type 2     4
Dd  fff Type 2     5
Dd  ggg Type 2     6
Dd   hh Type 4     7

对于NA中的df1,我需要根据以下规则从df2查找值:

1)如果只有单个实例,其中State = stateCity = city对于给定typedf2中,将value插入相应的df1Type1Type2

2)如果多个实例,其中State = stateCity = city对于给定的type,我需要对所有value求平均值并将其插入df1

3)如果给定State 无实例,其中state = Citycity = type,我需要获取此state的所有type的平均值并插入df1

4)如果给定的State 无实例其中state = type,则该值应保留NA { {1}}

只是为了澄清 - 基本上我想尽可能将值df1Type1平均为“已解决”。换句话说,我希望在可能的情况下使用Type2级别的平均值,但如果不可能,那么我想使用City级别平均值。不过,我想要回复State中概述的原始StateCity的平均值(即使df1平均值都可用

我知道这很复杂!我想要的结果是

State

这是一个数据框,如:

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(1L, NA), Type2 = c(2.5, 5)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

我甚至不知道从哪里开始解决这个问题。我的第一个想法是,我需要使用State City Type1 Type2 Aaa bb 1 2.5 Dd e NA 5.0 来重塑acast。例如,我可以使用

df2

它将数据重新整形为更接近acast(df2, state+city+value~type) ,但随后我松开了一些我需要保留的列(这些列被压缩到rowname中)。我甚至不知道如何开始搜索df1City的挑战,然后根据这些结果进行平均。

有人能指出我正确的方向吗?

编辑(2015年1月):我在下面的特洛伊回答下面添加了一条新评论,询问是否有一种简单的方法可以添加一个列,用于确定计算均值的级别(城市或州) 。我找到了一个解决方案,虽然可能有更好的方法,但它对我有用。希望这有助于某人!

State

然后

getlevel<-function(state,city,type){
m<-means[means$state==state & means$city==city & means$type==type, "mean"]
sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
ifelse(length(m)>0,"city","state")
}

2 个答案:

答案 0 :(得分:1)

编辑 - 抱歉误解了这些问题:以下是符合您条件的更正代码:

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
state_means<-ddply(df2,.(state,type),summarize,mean=mean(value))
getval<-function(state,city,type){
  m<-means[means$state==state & means$city==city & means$type==type, "mean"]
  sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
  ifelse(length(m)>0,m,sm)
}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 以前的答案(不完整)

这有点困难,因为你的结构调用对于df2不能正常工作,你的示例数据集不会给你预期结果中的所有数据,但我认为你想要的是:

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
getval<-function(state,city,type){means[means$state==state & means$city==city & means$type==type, "mean"]}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

############################################################################X

## what's happening in detail:
require(plyr)                      # calls the plyr library
means<-ddply(df2,                  # base on df2 
             .(state,city,type),   # summarize by combination of city/state/type
             summarize,            # tells plyr to summarize rather than transform
             mean=mean(value))     # show one column at each summary level, called 'mean', the average val

getval<-function(state,city,type){     # create function called getval, takes 3 parameters
  means[means$state==state &           # first part of [X,]
          means$city==city &           # selects the row that matches all criteria
          means$type==type,            
        "mean"]}                       # and [,X] the column relating to the type

getval("Aaa","bb","Type 2")
# this gives you the new df2
ddply(df1,                             # base on df1
      .(State,City),                   # summarize by State & City
      transform,                       # tell plyr to transform existingn set rather than roll up
      Type1=getval(as.character(State),as.character(City),"Type 1"),   # call getval() for Type 1
      Type2=getval(as.character(State),as.character(City),"Type 2"))   # and for Type 2

为您提供以下内容(不是您的预期结果,但数据隐含的内容)

  State City Type1 Type2
1   Aaa   bb     1    NA
2    Dd    e    NA    NA

答案 1 :(得分:0)

首先重塑df2中的数据,然后使用data.table的密钥合适地合并数据:

library(data.table)
library(reshape2)

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

首先,通过重塑

来修复dt2的Type
dt2.casted <- reshape2::dcast(dt2, state + city ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

接下来,设置keys以便合并

setkey(dt2.casted, state, city)
setkey(dt1, State, City)

最后,合并并取平均值,淘汰NA s

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE), by=State, .SDcols=grep("Type", names(dt2.casted), value=TRUE)]

   State Type 1 Type 2 Type 4
1:   Aaa      1   2.50    NaN
2:    Dd    NaN   5.25      7

替代,基于评论(无“城市”聚合)

dt2.casted <- reshape2::dcast(dt2, state ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

setkey(dt2.casted, state)
setkey(dt1, State)

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE)
                , by=list(State, City)
                , .SDcols=grep("Type"
                , names(dt2.casted), value=TRUE)
                ]

   State City Type 1 Type 2 Type 4
1:   Aaa   bb      1    2.5    NaN
2:    Dd    e    NaN    5.0      7