在列之间找到一对一,一对多和多对一的关系

时间:2015-09-28 16:54:14

标签: r dplyr

考虑以下数据框:

 first_name last_name
1         Al     Smith
2         Al     Jones
3       Jeff  Thompson
4      Scott  Thompson
5      Terry    Dactil
6       Pete       Zah

data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah"))

在此数据框中,first_name与last_name有三种关联方式:

  • 一对一(即first_name之间存在唯一关系 和last_name)
  • 一对多(即一个first_name指向多个 last_name values)
  • 多对一(即多个first_name值指向    到一个last_name)

我希望能够快速识别三种情况中的每一种并将它们输出到数据框。因此,结果数据框将是:

一对一

  first_name last_name
1      Terry    Dactil
2       Pete       Zah

一对多

  first_name last_name
1         Al     Smith
2         Al     Jones

多对一

   first_name last_name
1       Jeff  Thompson
2      Scott  Thompson

我想在dplyr包中做到这一点。

4 个答案:

答案 0 :(得分:6)

通常,您可以使用duplicated函数检查值是否重复(如@RichardScriven在您的问题评论中所述)。但是,默认情况下,此函数不会将多次出现的元素的第一个实例标记为重复:

duplicated(c(1, 1, 1, 2))
# [1] FALSE  TRUE  TRUE FALSE

由于您还想要接收这些案例,您通常希望在每个向量上运行duplicated两次,一次向前,一次向后:

duplicated(c(1, 1, 1, 2)) | duplicated(c(1, 1, 1, 2), fromLast=TRUE)
# [1]  TRUE  TRUE  TRUE FALSE

我发现这是一个很多的输入,所以我将定义一个帮助函数,检查一个元素是否出现多次:

d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)

现在你想要的逻辑是简单的单行:

# One to one
data[!d(data$first_name) & !d(data$last_name),]
#   first_name last_name
# 5      Terry    Dactil
# 6       Pete       Zah

# One to many
data[d(data$first_name) & !d(data$last_name),]
#   first_name last_name
# 1         Al     Smith
# 2         Al     Jones

# Many to one
data[!d(data$first_name) & d(data$last_name),]
#   first_name last_name
# 3       Jeff  Thompson
# 4      Scott  Thompson

请注意,您还可以使用d函数在duplicated的帮助下定义table

d <- function(x) table(x)[x] > 1

虽然这个替代定义稍微简洁一些,但我也发现它的可读性较差。

答案 1 :(得分:1)

这是一种纯粹的dplyr方法,使用与 josliber 相同的逻辑,为每个变量添加新的计数列:

data <- data %>% 
  add_count(first_name, name="first_name_n") %>%
  add_count(last_name, name="last_name_n")

# one-to-one
data %>% filter(first_name_n == 1 & last_name_n == 1)

# one-to-many
data %>% filter(first_name_n == 1 & last_name_n > 1)

# many-to-one
data %>% filter(first_name_n > 1 & last_name_n == 1)

答案 2 :(得分:0)

使用@josliber的建议方法,我构造了一个函数,该函数接受两个向量并返回它们的关系类型:

relationship_type <- function(x1, x2, na.rm = FALSE) {

  df <- tibble(x1 = x1, x2 = x2)

  if (na.rm) {
    df <- df %>%
      drop_na()
  }

  res <- c()

  counts <- df %>%
    count(x1, x2) %>%
    ungroup() %>%
    select(-n) %>%
    count(x1, x2)

  if (any(is.na(counts$x2))) {
    res <- c(res, "one to zero")
  }

  if (any(is.na(counts$x1))) {
    res <- c(res, "zero to one")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to many")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to many")
  }

  res
}

从零到零将告诉您一个向量中的某些条目是否映射到另一向量中的任何条目。您可以为此函数编写另一个包装,该包装接受一个数据帧和一对列名,并将结果传递回去。

答案 3 :(得分:0)

如果数据包含重复项,则使用duplicated函数的任何解决方案都将不起作用。

例如:

data1 = data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete", "Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah", "Smith","Jones", "Dactil","Zah"))

以下是使用data.table的解决方案,该解决方案适用于上述情况以及OP的原始数据:

library(data.table)
setDT(data1)

# One to one
temp1 = data1[ , uniqueN(last_name), by = 'first_name'][V1 == 1]
temp2 = data1[ , uniqueN(first_name), by = 'last_name'][V1 == 1]
data1[first_name %in% temp1$first_name & last_name %in% temp2$last_name]

#    first_name last_name
# 1:      Terry    Dactil
# 2:       Pete       Zah
# 3:      Terry    Dactil
# 4:       Pete       Zah


# One to many
temp3 = data1[ , uniqueN(last_name), by = 'first_name'][V1 > 1]
data1[first_name %in% temp3$first_name][order(first_name)]

#    first_name last_name
# 1:         Al     Smith
# 2:         Al     Jones
# 3:       Jeff  Thompson
# 4:       Jeff     Smith
# 5:      Scott  Thompson
# 6:      Scott     Jones


# Many to one
temp4 = data1[ , uniqueN(first_name), by = 'last_name'][V1 > 1]
data1[last_name %in% temp4$last_name][order(last_name)]

#    first_name last_name
# 1:         Al     Jones
# 2:      Scott     Jones
# 3:         Al     Smith
# 4:       Jeff     Smith
# 5:       Jeff  Thompson
# 6:      Scott  Thompson