将全名矢量拆分为2个单独矢量的有效方法

时间:2016-07-22 15:45:38

标签: r string strstr

我有一个由全名组成的向量,名字用逗号分隔,这是前几个元素的样子:

> head(val.vec)
[1] "Aabye,ֲ Edgar"        "Aaltonen,ֲ Arvo"      "Aaltonen,ֲ Paavo"    
[4] "Aalvik Grimsb,ֲ Kari" "Aamodt,ֲ Kjetil Andr" "Aamodt,ֲ Ragnhild

我正在寻找一种方法将它们分成2个单独的名和姓列。我的最终目的是将它们都作为更大数据框架的一部分。

我尝试使用strsplit这样的功能

names<-unlist(strsplit(val.vec,','))

但它给了我一个长矢量而不是2个独立的集合,我知道它是 可以使用循环并遍历所有元素,并将名字和姓氏放在2个单独的向量中,但考虑到大约有25000条记录,这是一个小时间。

我看到了一些类似的问题,但讨论的是如何在C +和Java

上做到这一点

4 个答案:

答案 0 :(得分:5)

我们可以使用read.csvvector转换为包含2列的data.frame

read.csv(text=val.vec, header=FALSE, stringsAsFactors=FALSE)

或者,如果我们使用的是strsplit,而不是unlist(将整个list转换为单个vector),我们可以提取第一个和第二个list vector中的元素分别创建两个lst <- strsplit(val.vec,',') v1 <- lapply(lst, `[`, 1) v2 <- lapply(lst, `[`, 2) s(&#39; v1&#39;和&#39; v2&#39;)。

sub

另一种选择是v1 <- sub(",.*", "", val.vec) v2 <- sub("[^,]+,", "", val.vec)

val.vec <- c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", 
        "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")

数据

SELECT * FROM dm
INNER JOIN
(
  SELECT MAX(id) as id FROM ( 
    SELECT MAX(id) as id, receiver as contact
    FROM dm
    WHERE sender="Jack"
    GROUP BY receiver
    UNION ALL
    SELECT MAX(id) as id, sender as contact
    FROM dm
    WHERE receiver="Jack"
    GROUP BY sender
  ) t GROUP BY contact
) d
ON dm.id = d.id
ORDER BY senttime DESC;

答案 1 :(得分:2)

另一种选择:

library(stringi)
stri_split_fixed(val.vec, ",", simplify = TRUE)

给出了:

#     [,1]            [,2]          
#[1,] "Aabye"         "ֲ Edgar"      
#[2,] "Aaltonen"      "ֲ Arvo"       
#[3,] "Aaltonen"      "ֲ Paavo"      
#[4,] "Aalvik Grimsb" "ֲ Kari"       
#[5,] "Aamodt"        "ֲ Kjetil Andr"
#[6,] "Aamodt"        "ֲ Ragnhild"  

如果您希望将结果放在data.frame中,可以将其打包在as.data.frame()

答案 2 :(得分:1)

只需将您的函数调用包含在sapply调用中:

val.vec = c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")

names = t(sapply(val.vec, function(x) unlist(strsplit(x,','))))
names

#> names
#                     [,1]            [,2]           
#Aabye,? Edgar        "Aabye"         "? Edgar"      
#Aaltonen,? Arvo      "Aaltonen"      "? Arvo"       
#Aaltonen,? Paavo     "Aaltonen"      "? Paavo"      
#Aalvik Grimsb,? Kari "Aalvik Grimsb" "? Kari"       
#Aamodt,? Kjetil Andr "Aamodt"        "? Kjetil Andr"
#Aamodt,? Ragnhild    "Aamodt"        "? Ragnhild"  

使用您尝试过的解决方案,我们可以将其强制转换为两列。

val.vec = c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")
names = matrix(unlist(strsplit(val.vec,',')), ncol = 2L, byrow = TRUE)
#> names
#     [,1]            [,2]           
#[1,] "Aabye"         "? Edgar"      
#[2,] "Aaltonen"      "? Arvo"       
#[3,] "Aaltonen"      "? Paavo"      
#[4,] "Aalvik Grimsb" "? Kari"       
#[5,] "Aamodt"        "? Kjetil Andr"
#[6,] "Aamodt"        "? Ragnhild"   

根据Richard Scriven提出的(非常快速)解决方案进行测试,我们可以看到你和他的相同:

#> library(microbenchmark)
#> microbenchmark(
#+   names_1 = do.call(rbind, strsplit(val.vec, ",")),
#+   names_2 = matrix(unlist(strsplit(val.vec,',')), ncol = 2L, byrow = TRUE),
#+   times = 10000L
#+ )
#Unit: microseconds
#    expr    min     lq     mean median     uq      max neval cld
# names_1 12.596 13.530 15.08867 13.996 14.463  513.185 10000   b
# names_2 11.663 12.131 14.03413 12.597 13.530 1436.917 10000  a 

答案 3 :(得分:0)

如果您采用dplyr方式做事,请查看separate套餐中的tidyr

library(dplyr)
library(tidyr)

dat = data.frame(val = c("Lee, John", "Lee, Spike", "Doe, John", 
        "Longstocking, Pippy", "Bond, James", "Jordan, Michael"))
#                   val
# 1           Lee, John
# 2          Lee, Spike
# 3           Doe, John
# 4 Longstocking, Pippy
# 5         Bond, James
# 6     Jordan, Michael
dat %>% 
  separate(val, c('last_name', 'first_name'), sep = ',') %>% 
  mutate(first_name = trimws(first_name))
#      last_name first_name
# 1          Lee       John
# 2          Lee      Spike
# 3          Doe       John
# 4 Longstocking      Pippy
# 5         Bond      James
# 6       Jordan    Michael

在对trimws的调用中添加以消除前导空格。