Question

我有两个数据框d1和d2分别为：

我想要类似的东西：

我真的很抱歉这个微不足道的问题，我无法得到答案。

Answer 1

根据您的说明，我们了解您希望在z和d1时将z中的d2值替换为x中的y值。 d3 <- merge(d1, d2, by = c("x","y"), all.x = TRUE) d3[is.na(d3$z.y),"z.y"] <- d3[is.na(d3$z.y),"z.x"] d3 <- d3[,-3] names(d3)[3] <- "z"匹配。

使用基数R：

> d3
   x  y   z
1 10 10 100
2 10 12   6
3 11 10 200
4 11 12   2
5 12 10   1
6 12 12 400

给出：

library(data.table)

setDT(d1) # convert the data.frame to a data.table
setDT(d2) # idem

# join the two data.table's and replace the values
d1[d2, on = .(x, y), z := i.z]

使用 data.table -package：

setDT(d1)[setDT(d2), on = .(x, y), z := i.z]

或一气呵成：

> d1
    x  y   z
1: 10 10 100
2: 10 12   6
3: 11 10 200
4: 11 12   2
5: 12 10   1
6: 12 12 400

给出：

d3 <- left_join(d1, d2, by = c("x","y")) %>%
  mutate(z.y = ifelse(is.na(z.y), z.x, z.y)) %>%
  select(-z.x) %>%
  rename(z = z.y)

使用 dplyr 包：

public static void main(String[] args) {
^I
^I System.out.println("Some Garbage printed upon using :insert"); 
}

Answer 2

在优秀的@Jaap answer wrt data.table

之上

在data.table中，您可以使用键加入，使用data.table通过使用键完成的任何操作都是最快的选择。您甚至可以使用不同的列名称，请参阅下面的修改示例。

顺便说一下，数据：

和代码：

library(data.table)

d1 <- fread("d1.csv", sep=" ")
d2 <- fread("d2.csv", sep=" ")

# here is data.table keys magic
# note different column names
setkey(d1, x, q)
setkey(d2, x, y)

q <- d2[d1][is.na(z), z := i.z][, i.z := NULL]

print(q)

结果：

    x  y   z
1: 10 10 100
2: 10 12   6
3: 11 10 200
4: 11 12   2
5: 12 10   1
6: 12 12 400

Answer 3

听起来您希望确保每个z和x值只有一个y值。主要问题是如何选择与之关联的z值。从描述中，我猜测你要么总是要覆盖第二个数据帧，要么想要获取最大值。

从原始数据开始：

df1 <- structure(list(x = c(10L, 10L, 11L, 11L, 12L, 12L), y = c(10L, 12L, 10L, 12L, 10L, 12L), z = c(7L, 6L, 8L, 2L, 1L, 5L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(x = 10:12, y = c(10L, 10L, 12L), z = c(100L, 200L,400L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,-3L))

如果它是您想要的最大值，那么您可能只想简单地合并两个帧，然后提取每个x和y的最大值：

merged.df <- aggregate(z ~ x + y, data = rbind(df1, df2), max)

相反，如果您希望第二个数据框覆盖第一个数据框，那么您将使用最后一个值进行聚合以匹配

merged.df <- aggregate(z ~ x+ y, data=rbind(df1, df2), function(d) tail(d, n=1))

如果除了z之外还有很多列，那么我只能假设您需要后一种行为。为此，您最好使用data.table或dplyr等库。在dplyr中，它看起来像这样

require(dplyr)
merged.df <- rbind(df1, df2) %>% group_by(x, y) %>% summarise_each(funs(last))

使用data.table，它看起来像

require(data.table)
merged.df <- setDT(rbind(df1, df2))[, lapply(.SD, last), .(x,y)]

Answer 4

或者使用[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?可以通过使用merge和match在 base 中完成此 update-join 用于对表进行子设置的索引，以及which用于从两个表中构成一个键向量。

通过这种方式，interaction的 order 或 size 都不会改变。如果d1中的 key 被表示两次，则第一次出现将用于更新d2。

d1

或者，您可以检查d1和d2之间是否存在匹配项，例如：

d1 <- read.table(header=TRUE, text="x   y  z
10  10 7
10  12 6
11  10 8
11  12 2
12  10 1
12  12 5")
d2 <- read.table(header=TRUE, text="x  y  z
10 10 100
11 10 200
12 12 400")

key <- c("x", "y") #define which columns are used as matching key
idx <- match(interaction(d2[key]), interaction(d1[key])) #find where it matches
d1$z[idx] <- d2$z #make the update

d1 #show result
#   x  y   z
#1 10 10 100
#2 10 12   6
#3 11 10 200
#4 11 12   2
#5 12 10   1
#6 12 12 400

或

idx <- match(interaction(d1[key]), interaction(d2[key]))
idxn <- which(!is.na(idx)) #find where it does not match
d1$z[idxn] <- d2$z[idx[idxn]]

合并不同大小的数据框

4 个答案: