Question

我有一个复杂的问题，我不知道该怎么办。我有两个名为df1的数据框：

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(NA, NA), Type2 = c(NA, NA)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

和df2：

structure(list(state = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Aaa", "Dd"), class = "factor"), city = structure(c(1L, 
2L, 3L, 4L, 4L, 5L, 6L), .Label = c("bb", "ccc", "ddd", "fff", 
"ggg", "hh"), class = "factor"), type = structure(c(1L, 2L, 2L, 
2L, 2L, 2L, 3L), .Label = c("Type 1", "Type 2", "Type 4"), class = "factor"), 
value = 1:7), .Names = c("state", "city", "type", "value"
), class = "data.frame", row.names = c(NA, -7L))

Dataframe df1如下所示：

State City Type1 Type2
Aaa   bb    NA    NA
Dd    e    NA    NA

和dataframe df2如下所示：

state city   type value
Aaa   bb Type 1     1
Aaa  ccc Type 2     2
Aaa  ddd Type 2     3
Dd  fff Type 2     4
Dd  fff Type 2     5
Dd  ggg Type 2     6
Dd   hh Type 4     7

对于NA中的df1，我需要根据以下规则从df2查找值：

1）如果只有单个实例，其中State = state和City = city对于给定type在df2中，将value插入相应的df1列Type1或Type2

2）如果多个实例，其中State = state和City = city对于给定的type，我需要对所有value求平均值并将其插入df1

3）如果给定State 无实例，其中state = City和city = type，我需要获取此state的所有type的平均值并插入df1

4）如果给定的State 无实例其中state = type，则该值应保留NA { {1}}

只是为了澄清 - 基本上我想尽可能将值df1和Type1平均为“已解决”。换句话说，我希望在可能的情况下使用Type2级别的平均值，但如果不可能，那么我想使用City级别平均值。不过，我想要回复State中概述的原始State和City的平均值（即使df1平均值都可用

我知道这很复杂！我想要的结果是

State

这是一个数据框，如：

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(1L, NA), Type2 = c(2.5, 5)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

我甚至不知道从哪里开始解决这个问题。我的第一个想法是，我需要使用State City Type1 Type2 Aaa bb 1 2.5 Dd e NA 5.0来重塑acast。例如，我可以使用

df2

它将数据重新整形为更接近acast(df2, state+city+value~type)，但随后我松开了一些我需要保留的列（这些列被压缩到rowname中）。我甚至不知道如何开始搜索df1和City的挑战，然后根据这些结果进行平均。

有人能指出我正确的方向吗？

编辑（2015年1月）：我在下面的特洛伊回答下面添加了一条新评论，询问是否有一种简单的方法可以添加一个列，用于确定计算均值的级别（城市或州）。我找到了一个解决方案，虽然可能有更好的方法，但它对我有用。希望这有助于某人！

State

然后

getlevel<-function(state,city,type){
m<-means[means$state==state & means$city==city & means$type==type, "mean"]
sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
ifelse(length(m)>0,"city","state")
}

Answer 1

编辑 - 抱歉误解了这些问题：以下是符合您条件的更正代码：

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
state_means<-ddply(df2,.(state,type),summarize,mean=mean(value))
getval<-function(state,city,type){
  m<-means[means$state==state & means$city==city & means$type==type, "mean"]
  sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
  ifelse(length(m)>0,m,sm)
}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 以前的答案（不完整）

这有点困难，因为你的结构调用对于df2不能正常工作，你的示例数据集不会给你预期结果中的所有数据，但我认为你想要的是：

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
getval<-function(state,city,type){means[means$state==state & means$city==city & means$type==type, "mean"]}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

############################################################################X

## what's happening in detail:
require(plyr)                      # calls the plyr library
means<-ddply(df2,                  # base on df2 
             .(state,city,type),   # summarize by combination of city/state/type
             summarize,            # tells plyr to summarize rather than transform
             mean=mean(value))     # show one column at each summary level, called 'mean', the average val

getval<-function(state,city,type){     # create function called getval, takes 3 parameters
  means[means$state==state &           # first part of [X,]
          means$city==city &           # selects the row that matches all criteria
          means$type==type,            
        "mean"]}                       # and [,X] the column relating to the type

getval("Aaa","bb","Type 2")
# this gives you the new df2
ddply(df1,                             # base on df1
      .(State,City),                   # summarize by State & City
      transform,                       # tell plyr to transform existingn set rather than roll up
      Type1=getval(as.character(State),as.character(City),"Type 1"),   # call getval() for Type 1
      Type2=getval(as.character(State),as.character(City),"Type 2"))   # and for Type 2

为您提供以下内容（不是您的预期结果，但数据隐含的内容）

  State City Type1 Type2
1   Aaa   bb     1    NA
2    Dd    e    NA    NA

Answer 2

首先重塑df2中的数据，然后使用data.table的密钥合适地合并数据：

library(data.table)
library(reshape2)

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

首先，通过重塑

来修复dt2的Type列

dt2.casted <- reshape2::dcast(dt2, state + city ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

接下来，设置`keys`以便合并

setkey(dt2.casted, state, city)
setkey(dt1, State, City)

最后，合并并取平均值，淘汰`NA` s

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE), by=State, .SDcols=grep("Type", names(dt2.casted), value=TRUE)]

   State Type 1 Type 2 Type 4
1:   Aaa      1   2.50    NaN
2:    Dd    NaN   5.25      7

替代，基于评论（无“城市”聚合）

dt2.casted <- reshape2::dcast(dt2, state ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

setkey(dt2.casted, state)
setkey(dt1, State)

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE)
                , by=list(State, City)
                , .SDcols=grep("Type"
                , names(dt2.casted), value=TRUE)
                ]

   State City Type 1 Type 2 Type 4
1:   Aaa   bb      1    2.5    NaN
2:    Dd    e    NaN    5.0      7

匹配2个数据帧的多列之间的数据，以基于一个或两个匹配的列返回“匹配”值或平均值

2 个答案:

首先，通过重塑

接下来，设置`keys`以便合并

最后，合并并取平均值，淘汰`NA` s

替代，基于评论（无“城市”聚合）

匹配2个数据帧的多列之间的数据，以基于一个或两个匹配的列返回“匹配”值或平均值

2 个答案:

首先，通过重塑

接下来，设置keys以便合并

最后，合并并取平均值，淘汰NA s

替代，基于评论（无“城市”聚合）

接下来，设置`keys`以便合并

最后，合并并取平均值，淘汰`NA` s