使用一个data.frame更新另一个

时间:2011-11-01 19:04:08

标签: r indexing dataframe

鉴于2个数据帧在列名/数据类型方面是相同的,其中一些列唯一地标识行,是否有一个有效的函数/方法用于一个data.frame来“更新”另一个?

例如,在下文中,originalreplacement'Name''Id'标识。 goal是查找replacement original中所有行(通过唯一ID)并替换为Value1Value2

的结果
original = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,NA), Value2 = c(NA,9.2) )
replacement = data.frame( Name = c("john") , Id = 2 , Value1 = 2.2 , value2 = 5.9)
goal = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,2.2), Value2 = c(NA,5.9) )

该解决方案适用于任意长度的originalreplacement(尽管replacement的行数不应超过original)。在实践中,我使用2个id列。

7 个答案:

答案 0 :(得分:11)

我会使用data.table个对象。此代码似乎适用于您的示例:

library(data.table)

# set keys
original.dt <- data.table(original, key=c("Name", "Id"))        
replacement.dt <- data.table(replacement, key=c("Name", "Id"))

goal2 <- original.dt
# subset and reassign
# goal2[replacement.dt[, list(Name, Id)]] <- replacement.dt
goal2[replacement.dt] <- replacement.dt  # cleaner and faster, see Matthew's comment

goal2 <- as.data.frame(goal2)

identical(goal, goal2) # FALSE, why? See Joris's comment
all.equal(goal, goal2) # TRUE

答案 1 :(得分:6)

只需将唯一ID设置为行名称即可。然后它是简单的索引:

rownames(original) = original$Id
rownames(replacement) = replacement$Id

original[rownames(replacement), ] = replacement

答案 2 :(得分:6)

使用base R,您可以使用下面的replace.df()函数,该函数基于merge.data.frame()的源代码。与其他一些解决方案相反,这个解决方案允许多列进行识别。我经常在工作中使用它。随意复制和使用。

此函数控制在x中找不到y中的行的情况。请注意,该功能不会检查组合是否唯一。 match()只会在组合的第一次出现时替换第一次出现。

该功能使用如下:

> replace.df(original, replacement,by=c('Name','Id'))
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2    2.2    9.2

请注意,这可以有效地检测原始代码中的写入错误。 replacement包含名为“value2”(小v)而不是Value2(大写字母V)的变量。纠正此后,结果变为:

> replace.df(original, replacement,by=c('Name','Id'))
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2    2.2    5.9

您也可以使用该功能更改某些列中的值

> replace.df(original, replacement,by=c('Name','Id'),cols='Value2')
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2     NA    5.9

功能:

replace.df <- function(x,y,by,cols=NULL
           ){
    nx <- nrow(x)
    ny <- nrow(y)

    bx <- x[,by,drop=FALSE]
    by <- y[,by,drop=FALSE]
    bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))

    bx <- bz[seq_len(nx)]
    by <- bz[nx + seq_len(ny)]

    idx <- match(by,bx)
    idy <- match(bx,by)
    idy <- idy[!is.na(idy)]

    if(is.null(cols)) {
      cols <- intersect(names(x),names(y))
      cols <- cols[!cols %in% by]
    }

    x[idx,cols] <- y[idy,cols]
    x
  }

答案 3 :(得分:2)

以下是使用digest包的方法。

library(digest)
# generate keys for each row using the md5 checksum based on first two columns
check1 <- apply(original[,1:2], 1, digest)
check2 <- apply(replacement[,1:2], 1, digest)

# set goal to original and replace rows in replacement
goal <- original
goal[check1 %in% check2,] <- replacement

答案 4 :(得分:1)

# limit replacement to elements that have a correspondence in original 
existing = replacement[is.element(replacement$Id, original$Id),]
# replace original at positions where IDs from existing match   
original[match(existing$Id,original$Id),]=existing

答案 5 :(得分:1)

require(plyr)
indexes_to_replace <- rownames(match_df(original,replacement,on='Id'))
indexes_from_replace<-rownames(match_df(replacement,original,on='Id'))
original[indexes_to_replace,] <- replacement[indexes_from_replace,]
函数on

match_df也可以使用向量。

答案 6 :(得分:1)

我制作了一个使用索引方法的函数(参见上面John Colby的回答)。希望它可以用于使用来自另一个数据帧的值更新一个数据帧的所有这些需求。

update.df.with.df <- function(original, replacement, key, value) 
{
    ## PURPOSE: Update a data frame with the values in another data frame
    ## ----------------------------------------------------------------------
    ## ARGUMENT:
    ##   original: a data frame to update,
    ##   replacement: a data frame that has the updated values,
    ##   key: a character vector of variable names to form the unique key
    ##   value: a character vector of variable names to form the values that need to be updated
    ## ----------------------------------------------------------------------
    ## RETURN: The updated data frame from the old data frame "original". 
    ## ----------------------------------------------------------------------
    ## AUTHOR: Feiming Chen,  Date:  2 Dec 2015, 15:08

    n1 <- rownames(original) <- apply(original[, key, drop=F], 1, paste, collapse=".")
    n2 <- rownames(replacement) <- apply(replacement[, key, drop=F], 1, paste, collapse=".")

    n3 <- merge(data.frame(n=n1), data.frame(n=n2))[[1]] # make common keys
    n4 <- levels(n3)[n3]                # convert factor to character

    original[n4, value] <- replacement[n4, value] # update values on the common keys
    original
}
if (F) {                                # Unit Test 
    original <- data.frame(x=c(1, 2, 3), y=c(10, 20, 30))
    replacement <- data.frame(x=2, y=25)
    update.df.with.df(original, replacement, key="x", value="y") # data.frame(x=c(1, 2, 3), y=c(10, 25, 30))

    original <- data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 20, 30))
    replacement <- data.frame(x=2, w="b", y=25)
    update.df.with.df(original, replacement, key=c("x", "w"), value="y") # data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 25, 30))

    original = data.frame(Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,NA), Value2 = c(NA,9.2))
    replacement = data.frame(Name = c("john") , Id = 2 , Value1 = 2.2 , Value2 = 5.9)
    update.df.with.df(original, replacement, key="Id", value=c("Value1", "Value2"))
    ## goal = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,2.2), Value2 = c(NA,5.9) )
}