如何基于两个相同数据集中的多个列在R中执行顺序合并

时间:2017-07-11 00:30:22

标签: r

我需要在R中执行顺序合并,我的意思是,假设我有两个数据集:订单和交付。

我想将这些订单和交付一起匹配,但我首先要根据地址列进行合并,然后对于不匹配的行,我想基于邮政编码合并,然后对于那些行不匹配,我想基于纬度和经度合并,然后对于那些不匹配的行,我想合并一些其他属性,依此类推。

我可以根据一个属性轻松进行合并:

    merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
 by.y = c("date", "delivery_address"), sort = FALSE)

但是现在我想匹配那些在merge1中不匹配的行,让我们说两个列中有两个不同名称的邮政编码(一个数据集中的“zipcode”和另一个数据集中的“邮政编码”)。

我尝试在订单上执行左连接,然后找到为merge1的交付数据集中的某些列返回NA的行,然后尝试使用该子集进行另一次合并,但是无法成功执行此操作。

merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
     by.y = c("date", "delivery_address"), all.x = TRUE, sort = FALSE)

    merge2 <- merge(merge1[is.na(merge1$delivery_address),], deliveries, by.x = c("order_date", "zipcode"), 
by.y = c("date", "postcode"), all.x = TRUE, sort = FALSE)

我知道这是完全错误的,因为它只返回我的NA值并重复列,但那是我的思路。

基本上,只想要一种方法在两个数据集之间进行R的顺序合并,首先是一列,然后是另一列,依此类推。我不想要一个左连接,一个只返回匹配行的内连接,但是,我可以进行左连接,然后在所有合并后,只选择没有NA的行。因此,我的最终结果应该是与交货相匹配的所有订单,但只有相应匹配的订单。

编辑:

人们要求提供一些示例数据,所以这里有一些:

orders <- data.frame( order = c(1,2,3,4,5,6,7,8,9,10),
                      address = c(1111, 1112, 1314, 1113, 1114, 1618, 1917, 1118, 1945, 2000),
                      zipcode = c(001, 002, 001, 999, 999, 006, 007, 007, 999, 010))

deliveries <- data.frame(length = c(4, 5, 9, 11, 13, 15, 93, 17, 4, 8, 12), 
                         delivery_address = c(1111, 1112, 0111, 1113, 1114, 0000, 1618, 0001, 0002, 0405, 1121),
                         postcode = c(001, 912, 001, 910, 913, 006, 080, 007, 074, 088, 010))


merge1 <- merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE)

所以merge1正确地给我订单与具有相同地址的交货相匹配,现在我如何添加到merge1数据集并添加那些与交付数据集不匹配的行,以便我可以通过邮政编码匹配它们因为仍有一些订单和交货可以通过邮政编码匹配。

2 个答案:

答案 0 :(得分:2)

这适用于您的示例数据:

# functions used here use dplyr to process data
library("dplyr")

# using forward pipe syntax for readability of this example
# though this isn't necessary for functions to work
library("magrittr")

# merge by exact matches between address and delivery_address
# add column of delivery_address for binding later so dataframes match
merge1 <- orders %>%
  inner_join(y = deliveries,
             by = c("address" = "delivery_address")) %>%
  mutate(delivery_address = address)

# extract unmerged columns from orders then merge exact matches by
# zipcode to postcode.
# add postcode column for binding
merge2 <- orders %>%
  anti_join(y = deliveries,
            by = c("address" = "delivery_address")) %>%
  inner_join(y = deliveries,
             by = c("zipcode" = "postcode")) %>%
  mutate(postcode = zipcode)

# bind two sets of results together.
results <- bind_rows(merge1, merge2)
results

我强烈推荐RStudio cheat sheets on data transformation进行此类工作

答案 1 :(得分:0)

考虑合并所有行和每个行绑定,然后使用merge1 <- unique(rbind(transform(merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE), delivery_address = address), transform(merge(orders, deliveries, by.x = "zipcode", by.y = "postcode", sort = FALSE), postcode = zipcode))) # address order zipcode length postcode delivery_address # 1 1111 1 1 4 1 1111 # 2 1112 2 2 5 912 1112 # 3 1113 4 999 11 910 1113 # 4 1114 5 999 13 913 1114 # 5 1618 6 6 93 80 1618 # 6 1314 3 1 9 1 111 # 7 1314 3 1 4 1 1111 # 8 1111 1 1 9 1 111 # 10 1618 6 6 15 6 0 # 11 1917 7 7 17 7 1 # 12 1118 8 7 17 7 1 # 13 2000 10 10 12 10 1121 删除重复项:

Map()

对于跨多个列的通用解决方案,在用户定义的函数 seqmerge 上使用do.call()seqmerge <- function(xvar, yvar) { df <- merge(orders, deliveries, by.x = xvar, by.y = yvar, sort = FALSE) df[[yvar]] = df[[xvar]] return(df) } xvars <- c("address", "zipcode") # ADD MORE AS NEEDED yvars <- c("delivery_address", "postcode") # ADD MORE AS NEEDED merge2 <- unique(do.call(rbind, Map(seqmerge, xvars, yvars, USE.NAMES=FALSE))) all.equal(merge1, merge2) # [1] TRUE identical(merge1, merge2) # [1] TRUE ,在其中扩展 xvar yvar 到合并列的配对。确保两者长度相同。

Realm.init(this);
        RealmConfiguration config = new RealmConfiguration.Builder().build();
        Realm.deleteRealm(config);
        Realm.setDefaultConfiguration(config);