说,我有两张桌子,名字和年龄是这样的:
> name
key name
1 a,b,c jack
2 d daniel
3 e foo
4 f,g bar
> age
key age
1 b 13
2 d 21
3 e 24
4 k 34
5 f 100
我正在尝试使用键列连接这两个表,该列存在于两个表中。这里的挑战是名称表中的键列未规范化。我的问题是,以一种方式组合上述两个表的最佳方法是,名表中的所有行都存在并且在连接表中保持原样(如&#34; res&#34; table)?< / p>
> res
key name age
1 a,b,c jack 13
2 d daniel 21
3 e foo 24
4 f,g bar 100
这是必要的表格信息
> dput(name)
structure(list(key = structure(1:4, .Label = c("a,b,c", "d",
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L,
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor")), .Names = c("key",
"name"), class = "data.frame", row.names = c(NA, -4L))
> dput(age)
structure(list(key = structure(c(1L, 2L, 3L, 5L, 4L), .Label = c("b",
"d", "e", "f", "k"), class = "factor"), age = c(13L, 21L, 24L,
34L, 100L)), .Names = c("key", "age"), class = "data.frame", row.names = c(NA,
-5L))
> dput(res)
structure(list(key = structure(1:4, .Label = c("a,b,c", "d",
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L,
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor"),
age = c(13L, 21L, 24L, 100L)), .Names = c("key", "name",
"age"), class = "data.frame", row.names = c(NA, -4L))
答案 0 :(得分:3)
也许你可以强迫&#34;键&#34;来自&#34;名称&#34;的列data.frame
为正则表达式模式并使用sapply
,如下所示:
sapply(gsub(",", "|", name$key), function(x) grep(x, age$key))
# a|b|c d e f|g
# 1 2 3 5
以上基本上返回&#34; age&#34;的行号。找到匹配项的data.frame
,按照找到的顺序排列。
然后,您可以使用此信息来提取&#34;年龄&#34;来自&#34;年龄&#34;的价值data.frame
使用基本[row, col]
提取,如下所示,将结果分配给age$age
:
age[sapply(gsub(",", "|", name$key), function(x) grep(x, age$key)), "age"]
# [1] 13 21 24 100
答案 1 :(得分:1)
对于每一行,我会使用stringi包中的stri_split_fixed
函数拆分每个复杂键,然后尝试匹配第二个数据集中的一个键。
library(stringi)
res <- name
keys <- stri_split_fixed(name$key, ",") # returns a list of individual keys in each row
res$age <- sapply(1:nrow(name), function(r) {
keys <- keys[[r]] # get the keys in rth row
age$age[which(age$key %in% keys)]
})
这会给出您要求的结果。
如果密钥包含(或可能包含)空格,则正则表达式搜索更合适:
stri_split_regex(name$key, ",\\p{Z}*")
甚至提取字符序列
stri_extract_all_regex(name$key, "\\w+")
答案 2 :(得分:1)
我不介意使用2个连接:
library(plyr)
# factors to character vectors:
name <- as.data.frame(sapply(name, as.character), stringsAsFactors=F)
# split comma-seperated ids into named list:
(tmp <- setNames(strsplit(name$key, ","), name$name))
# $jack
# [1] "a" "b" "c"
#
# $daniel
# [1] "d"
#
# $foo
# [1] "e"
#
# $bar
# [1] "f" "g"
# list to long 2-column data frame:
(tmp <- setNames(ldply(tmp, matrix), c("name", "key")) )
# name key
# 1 jack a
# 2 jack b
# 3 jack c
# 4 daniel d
# 5 foo e
# 6 bar f
# 7 bar g
# join data frame with age table (1st join) &
# add original comma-seperated key column (2nd join)
join(join(age, b, type="inner"),
name, by="name")[-1]
# age name key
# 1 13 jack a,b,c
# 2 21 daniel d
# 3 24 foo e
# 4 100 bar f,g