从因子字符串变量中提取唯一字符串

时间:2015-06-25 21:56:05

标签: r unique categorical-data

我有一个包含演员姓名的变量。

(actor=structure(c(4L, 1L, 6L, 2L, 5L, 3L), .Label = c("Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman", 
"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington", 
"Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci", 
"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe", 
"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow", 
"Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner"
), class = "factor"))
# [1] Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe
# [2] Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman            
# [3] Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner
# [4] Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington 
# [5] Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow     
# [6] Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci
# 6 Levels: Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman ...

我想从中提取所有完整的actor名称(name + surname),并在输出矩阵中创建它们。

1 个答案:

答案 0 :(得分:3)

如果要提取actor的唯一名称,可以使用as.character函数获取指定的actor,将其分隔为逗号strsplit,将结果列表中的所有向量组合在一起unlist,并使用unique获取唯一名称:

(all.actors <- unique(unlist(strsplit(as.character(actor), ", "))))
#  [1] "Leonardo DiCaprio"    "Joseph Gordon-Levitt" "Ellen Page"           "Ken Watanabe"        
#  [5] "Christian Bale"       "Tom Hardy"            "Anne Hathaway"        "Gary Oldman"         
#  [9] "Robert Downey Jr."    "Chris Evans"          "Scarlett Johansson"   "Jeremy Renner"       
# [13] "Jamie Foxx"           "Christoph Waltz"      "Kerry Washington"     "Mark Ruffalo"        
# [17] "Ben Kingsley"         "Max von Sydow"        "Jennifer Lawrence"    "Josh Hutcherson"     
# [21] "Liam Hemsworth"       "Stanley Tucci"    

通过使用as.character(actor),此代码仅使用显示在因子actor中的actor,即使该因子具有更多未使用的级别。如果您使用levels(actor)代替,您将获得因子级别中的所有参与者,无论他们是否在actors中使用。在定义all.actors时,您可以使用您喜欢的任何一种。

如果您想要一个矩阵,表明每个元素都包含actor中的每个元素,那么您可以

mat <- sapply(strsplit(as.character(actor), ", "), function(x) all.actors %in% x)
row.names(mat) <- all.actors
mat
#                       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# Leonardo DiCaprio     TRUE FALSE FALSE  TRUE  TRUE FALSE
# Joseph Gordon-Levitt  TRUE FALSE FALSE FALSE FALSE FALSE
# Ellen Page            TRUE FALSE FALSE FALSE FALSE FALSE
# Ken Watanabe          TRUE FALSE FALSE FALSE FALSE FALSE
# Christian Bale       FALSE  TRUE FALSE FALSE FALSE FALSE
# Tom Hardy            FALSE  TRUE FALSE FALSE FALSE FALSE
# Anne Hathaway        FALSE  TRUE FALSE FALSE FALSE FALSE
# Gary Oldman          FALSE  TRUE FALSE FALSE FALSE FALSE
# Robert Downey Jr.    FALSE FALSE  TRUE FALSE FALSE FALSE
# Chris Evans          FALSE FALSE  TRUE FALSE FALSE FALSE
# Scarlett Johansson   FALSE FALSE  TRUE FALSE FALSE FALSE
# Jeremy Renner        FALSE FALSE  TRUE FALSE FALSE FALSE
# Jamie Foxx           FALSE FALSE FALSE  TRUE FALSE FALSE
# Christoph Waltz      FALSE FALSE FALSE  TRUE FALSE FALSE
# Kerry Washington     FALSE FALSE FALSE  TRUE FALSE FALSE
# Mark Ruffalo         FALSE FALSE FALSE FALSE  TRUE FALSE
# Ben Kingsley         FALSE FALSE FALSE FALSE  TRUE FALSE
# Max von Sydow        FALSE FALSE FALSE FALSE  TRUE FALSE
# Jennifer Lawrence    FALSE FALSE FALSE FALSE FALSE  TRUE
# Josh Hutcherson      FALSE FALSE FALSE FALSE FALSE  TRUE
# Liam Hemsworth       FALSE FALSE FALSE FALSE FALSE  TRUE
# Stanley Tucci        FALSE FALSE FALSE FALSE FALSE  TRUE