通过多列搜索值,返回行的id

时间:2017-08-17 07:19:52

标签: r dplyr plyr

假设我有两个数据框:

A =由具有附加因子列的唯一电话号码组成的数据帧。假设nrow(A)= 20

B =由表示唯一住户的行和列出的电话号码的四列组成的数据框,以及唯一家庭ID的第五列。可能在多个B列中重复相同的数字。假设nrow(B)= 100

我想在检查A电话号码是否在四列中的任意一列后,返回一张表格,其中包含家庭ID为“A”的唯一电话号码。

例如:

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

如何返回如下所示的C:

c <- data.frame(phone = c("12345","12346","12456"),
                factor = c("OK","BAD","BAD"), 
                hhid = c("1121","1123","1124"))

我确信可以以非常优雅的方式或使用最少量的代码来完成此操作。我想过使用for循环或合并,但认为这是在错误的轨道上。打开使用任何包。

3 个答案:

答案 0 :(得分:3)

library(dplyr)
library(tidyr)

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

# reshape data and keep unique combinations
b2 = b %>% 
  gather(ph, phone, -hhid) %>% 
  select(-ph) %>% 
  distinct()

# join data frames
left_join(a, b2, by = "phone")

#   phone factor hhid
# 1 12345     OK 1121
# 2 12346    BAD 1123
# 3 12456    BAD 1124

答案 1 :(得分:2)

以下是data.table

的一个选项
library(data.table)
setDT(a)[unique(setDT(b)[, .(phone = unlist(.SD)), hhid][phone != ""]),
          hhid := hhid, on = .(phone)]
a
#   phone factor hhid
#1: 12345     OK 1121
#2: 12346    BAD 1123
#3: 12456    BAD 1124

答案 2 :(得分:0)

以下是base R解决方案,因为您以字符或选项的形式阅读数据:options(stringsAsFactors = F)

tmp <- unique(reshape(b, 
    direction="long",
    varying = 1:3,
    v.names="phone",
    timevar = "variable")[,c(1, 3)])
tmp[tmp$phone!="",]
merge(tmp, a, by="phone")
#  phone hhid factor
#1 12345 1121     OK
#2 12346 1123    BAD
#3 12456 1124    BAD