假设我有两个数据框:
A =由具有附加因子列的唯一电话号码组成的数据帧。假设nrow(A)= 20
B =由表示唯一住户的行和列出的电话号码的四列组成的数据框,以及唯一家庭ID的第五列。可能在多个B列中重复相同的数字。假设nrow(B)= 100
我想在检查A电话号码是否在四列中的任意一列后,返回一张表格,其中包含家庭ID为“A”的唯一电话号码。
例如:
a <- data.frame(phone=c("12345","12346","12456"),
factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""),
ph2 = c("","","12346","","12348"),
ph3 = c("","","","12456","67890"),
hhid = seq(1121,1125))
如何返回如下所示的C:
c <- data.frame(phone = c("12345","12346","12456"),
factor = c("OK","BAD","BAD"),
hhid = c("1121","1123","1124"))
我确信可以以非常优雅的方式或使用最少量的代码来完成此操作。我想过使用for循环或合并,但认为这是在错误的轨道上。打开使用任何包。
答案 0 :(得分:3)
library(dplyr)
library(tidyr)
a <- data.frame(phone=c("12345","12346","12456"),
factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""),
ph2 = c("","","12346","","12348"),
ph3 = c("","","","12456","67890"),
hhid = seq(1121,1125))
# reshape data and keep unique combinations
b2 = b %>%
gather(ph, phone, -hhid) %>%
select(-ph) %>%
distinct()
# join data frames
left_join(a, b2, by = "phone")
# phone factor hhid
# 1 12345 OK 1121
# 2 12346 BAD 1123
# 3 12456 BAD 1124
答案 1 :(得分:2)
以下是data.table
library(data.table)
setDT(a)[unique(setDT(b)[, .(phone = unlist(.SD)), hhid][phone != ""]),
hhid := hhid, on = .(phone)]
a
# phone factor hhid
#1: 12345 OK 1121
#2: 12346 BAD 1123
#3: 12456 BAD 1124
答案 2 :(得分:0)
以下是base R
解决方案,因为您以字符或选项的形式阅读数据:options(stringsAsFactors = F)
tmp <- unique(reshape(b,
direction="long",
varying = 1:3,
v.names="phone",
timevar = "variable")[,c(1, 3)])
tmp[tmp$phone!="",]
merge(tmp, a, by="phone")
# phone hhid factor
#1 12345 1121 OK
#2 12346 1123 BAD
#3 12456 1124 BAD