我有一个很大的数据集,所以这是一个玩具示例。
这是数据帧df
structure(list(Target = structure(c(1L, 4L, 5L, 2L, 3L), .Label = c("Jim",
"Kurt", "Lester", "Tara", "Taylor"), class = "factor"), Gender = structure(c(2L,
1L, 1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Partner1 = structure(c(1L,
4L, 4L, 2L, 3L), .Label = c("Andrew", "Jim", "Mickey", "Taylor"
), class = "factor"), Partner2 = structure(c(2L, 3L, 1L, 4L,
3L), .Label = c("Andrew", "Jim", "Kurt", "Mickey"), class = "factor"),
Partner4 = structure(c(4L, 3L, 2L, 3L, 1L), .Label = c("Andrew",
"Jim", "Lester", "Tara"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
我想使用此处提供的密钥来取消标识“目标”和“合作伙伴”列中的每个成员。
structure(list(name = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("Andrew",
"Jim", "Kurt", "Lester", "Mickey", "Taylor"), class = "factor"),
id = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("A3",
"J9", "K5", "L4", "M4", "T7"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L))
我知道您可以通过这种方式分别替换每个列的名称
df[["Partner1"]] <- key[ match(df[['Partner1']], key[['name']] ) , 'id']
但是我想对其进行矢量化处理,以便我可以将键内的每个名称重新编码为并行所有列中对应ID的
实际数据将是数百列,其中约30列是我要取消标识的列
有什么建议吗?
答案 0 :(得分:2)
R基上的可能解:
# column names to replace
cols <- c('Target','Partner1','Partner2','Partner4')
# convert df subset to a matrix of characters
mx <- as.matrix(df[,cols])
# get the replacements values using match
repl <- as.character(key$id)[match(mx,as.character(key$name))]
# substitute NA's in replacements with the original values
repl[is.na(repl)] <- mx[is.na(repl)]
# create a copy of df
df2 <- df
# replace the values of df2 with the replacements
df2[,cols] <- repl
结果:
> df2
Target Gender Partner1 Partner2 Partner4
1 J9 M A3 J9 Tara
2 Tara F T7 K5 L4
3 T7 F T7 A3 J9
4 K5 M J9 M4 L4
5 L4 M M4 K5 A3
答案 1 :(得分:2)
另一种R
基本解决方案:
# Create lookup vector
lu_vect <- setNames(as.character(df2[["id"]]), df2[["name"]])
lu_vect
# Jim Mickey Andrew Taylor Lester Kurt
# "J9" "M4" "A3" "T7" "L4" "K5"
# Make a list of columns we want to *update*
cols_to_anonymise <- c("Target", "Partner1", "Partner2", "Partner4")
# Anonymise column by column, if name is not in key, replace by NA
df[cols_to_anonymise] <- lapply(
df[cols_to_anonymise],
function(x) lu_vect[as.character(x)]
)
# Print out results
df
# Target Gender Partner1 Partner2 Partner4
# 1 J9 M A3 J9 <NA>
# 2 <NA> F T7 K5 L4
# 3 T7 F T7 A3 J9
# 4 K5 M J9 M4 L4
# 5 L4 M M4 K5 A3
答案 2 :(得分:1)
使用tidyverse
的一种可能性:
df %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
left_join(df2, by = c("val" = "name")) %>%
mutate(val = ifelse(var == "Gender", val,
ifelse(!is.na(id), paste0(id), NA_character_))) %>%
select(-id) %>%
spread(var, val) %>%
select(-rowid)
Gender Partner1 Partner2 Partner4 Target
1 M A3 J9 <NA> J9
2 F T7 K5 L4 <NA>
3 F T7 A3 J9 T7
4 M J9 M4 L4 K5
5 M M4 K5 A3 L4
首先,它执行广泛的数据转换。其次,它将转换后的df与df2连接起来。如果df2中有一个df中的名称的ID,它将用该ID替换df中的名称,否则用NA代替。最后,它将数据转换回原始格式。
或基本的R解决方案:
data.frame(apply(df[, -2], 2, function(x) as.character(df2$id)[match(x, as.character(df2$name))]),
Gender = df[, 2])
Target Partner1 Partner2 Partner4 Gender
1 J9 A3 J9 <NA> M
2 <NA> T7 K5 L4 F
3 T7 T7 A3 J9 F
4 K5 J9 M4 L4 M
5 L4 M4 K5 A3 M