当R中的密钥未规范化时,连接表的最佳方法是什么?

时间:2014-04-21 15:19:40

标签: r join merge

说,我有两张桌子,名字和年龄是这样的:

> name
    key   name
1 a,b,c   jack
2     d daniel
3     e    foo
4   f,g    bar
> age
  key age
1   b  13
2   d  21
3   e  24
4   k  34
5   f 100

我正在尝试使用键列连接这两个表,该列存在于两个表中。这里的挑战是名称表中的键列未规范化。我的问题是,以一种方式组合上述两个表的最佳方法是,名表中的所有行都存在并且在连接表中保持原样(如&#34; res&#34; table)?< / p>

> res
    key   name age
1 a,b,c   jack  13
2     d daniel  21
3     e    foo  24
4   f,g    bar 100

这是必要的表格信息

> dput(name)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor")), .Names = c("key", 
"name"), class = "data.frame", row.names = c(NA, -4L))

> dput(age)

structure(list(key = structure(c(1L, 2L, 3L, 5L, 4L), .Label = c("b", 
"d", "e", "f", "k"), class = "factor"), age = c(13L, 21L, 24L, 
34L, 100L)), .Names = c("key", "age"), class = "data.frame", row.names = c(NA, 
-5L))

> dput(res)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor"), 
    age = c(13L, 21L, 24L, 100L)), .Names = c("key", "name", 
"age"), class = "data.frame", row.names = c(NA, -4L))

3 个答案:

答案 0 :(得分:3)

也许你可以强迫&#34;键&#34;来自&#34;名称&#34;的列data.frame为正则表达式模式并使用sapply,如下所示:

sapply(gsub(",", "|", name$key), function(x) grep(x, age$key))
# a|b|c     d     e   f|g 
#     1     2     3     5 

以上基本上返回&#34; age&#34;的行号。找到匹配项的data.frame,按照找到的顺序排列。

然后,您可以使用此信息来提取&#34;年龄&#34;来自&#34;年龄&#34;的价值data.frame使用基本[row, col]提取,如下所示,将结果分配给age$age

age[sapply(gsub(",", "|", name$key), function(x) grep(x, age$key)), "age"]
# [1]  13  21  24 100

答案 1 :(得分:1)

对于每一行,我会使用stringi包中的stri_split_fixed函数拆分每个复杂键,然后尝试匹配第二个数据集中的一个键。

library(stringi)
res <- name
keys <- stri_split_fixed(name$key, ",") # returns a list of individual keys in each row
res$age <- sapply(1:nrow(name), function(r) {
   keys <- keys[[r]] # get the keys in rth row
   age$age[which(age$key %in% keys)]
})

这会给出您要求的结果。

如果密钥包含(或可能包含)空格,则正则表达式搜索更合适:

stri_split_regex(name$key, ",\\p{Z}*")

甚至提取字符序列

stri_extract_all_regex(name$key, "\\w+")

答案 2 :(得分:1)

我不介意使用2个连接:

library(plyr)
# factors to character vectors:
name <- as.data.frame(sapply(name, as.character), stringsAsFactors=F)

# split comma-seperated ids into named list:
(tmp <- setNames(strsplit(name$key, ","), name$name))
# $jack
# [1] "a" "b" "c"
# 
# $daniel
# [1] "d"
# 
# $foo
# [1] "e"
# 
# $bar
# [1] "f" "g"

# list to long 2-column data frame:
(tmp <- setNames(ldply(tmp, matrix), c("name", "key")) )
#     name key
# 1   jack   a
# 2   jack   b
# 3   jack   c
# 4 daniel   d
# 5    foo   e
# 6    bar   f
# 7    bar   g

# join data frame with age table (1st join) &
# add original comma-seperated key column (2nd join)
join(join(age, b, type="inner"),
     name, by="name")[-1] 
#   age   name   key
# 1  13   jack a,b,c
# 2  21 daniel     d
# 3  24    foo     e
# 4 100    bar   f,g