提取R中以“ at”分隔的主队和客队

时间:2019-03-29 01:35:51

标签: r

我在大学篮球比赛中有一个对决的载体:

c("#34 Colorado  at  #36 California", "#31 Utah  at  #87 Stanford", 
"#26 USC  at  #112 Wash State", "#56 UCLA  at  #134 Washington", 
"#187 W Illinois  at  #116 Neb Omaha", "#222 Denver  at  #58 S Dakota St", 
"#245 IUPUI  at  #170 South Dakota", "#268 Rice  at  #208 TX El Paso", 
"#274 North Texas  at  #344 TX-San Ant", "#14 Iowa  at  #3 Purdue"
)

我想要两个单独的向量:一个用于at之前的团队,另一个用于在at之后的团队。例如,第一个向量将具有ColoradoUtahUSC等,第二个向量将具有CaliforniaStanfordWash State,等等

注意我不希望#排名。我只想要球队名称。我尝试过str_split ing,但是由于间距不一致,因此效果不太好。

3 个答案:

答案 0 :(得分:1)

我们可以使用strsplit并在“ at”处分割,这将给我们两部分字符串,然后从每个部分中删除“#”,后跟数字,并将其放入数据帧中。

data.frame(t(sapply(strsplit(string, "\\bat\\b"), 
             function(x) trimws(sub("#[0-9]+", "", x)))))


#            X1           X2
#1     Colorado   California
#2         Utah     Stanford
#3          USC   Wash State
#4         UCLA   Washington
#5    W Illinois    Neb Omaha
#6       Denver  S Dakota St
#7        IUPUI South Dakota
#8         Rice   TX El Paso
#9  North Texas   TX-San Ant
#10        Iowa       Purdue

或使用tidyr::separate

tidyr::separate(data.frame(col = trimws(gsub("#[0-9]+", "", string))),
        col, into = c("T1", "T2"), sep = "\\bat\\b")


#            T1                T2
#1     Colorado        California
#2         Utah          Stanford
#3          USC        Wash State
#4         UCLA        Washington
#5   W Illinois         Neb Omaha
#6       Denver       S Dakota St
#7        IUPUI      South Dakota
#8         Rice        TX El Paso
#9  North Texas        TX-San Ant
#10        Iowa            Purdue

答案 1 :(得分:1)

使用str_extract_all()的另一种解决方案

df <- data.frame(stringsAsFactors = FALSE,
                 text = c("#34 Colorado  at  #36 California", "#31 Utah  at  #87 Stanford", 
                          "#26 USC  at  #112 Wash State", "#56 UCLA  at  #134 Washington", 
                          "#187 W Illinois  at  #116 Neb Omaha", "#222 Denver  at  #58 S Dakota St", 
                          "#245 IUPUI  at  #170 South Dakota", "#268 Rice  at  #208 TX El Paso", 
                          "#274 North Texas  at  #344 TX-San Ant", "#14 Iowa  at  #3 Purdue")
)

library(stringr)
library(dplyr)

df %>% 
    mutate(team_a = str_extract_all(text, "(?<=\\s).+(?=\\s+at)"),
           team_b = str_extract_all(text, "(?<=\\d\\s)[^\\d]+$"))
#>                                     text       team_a       team_b
#> 1       #34 Colorado  at  #36 California    Colorado    California
#> 2             #31 Utah  at  #87 Stanford        Utah      Stanford
#> 3           #26 USC  at  #112 Wash State         USC    Wash State
#> 4          #56 UCLA  at  #134 Washington        UCLA    Washington
#> 5    #187 W Illinois  at  #116 Neb Omaha  W Illinois     Neb Omaha
#> 6       #222 Denver  at  #58 S Dakota St      Denver   S Dakota St
#> 7      #245 IUPUI  at  #170 South Dakota       IUPUI  South Dakota
#> 8         #268 Rice  at  #208 TX El Paso        Rice    TX El Paso
#> 9  #274 North Texas  at  #344 TX-San Ant North Texas    TX-San Ant
#> 10               #14 Iowa  at  #3 Purdue        Iowa        Purdue

reprex package(v0.2.1)于2019-03-29创建

答案 2 :(得分:0)

我们可以在base R中执行此操作,方法是从“文本”列中删除子字符串,然后使用read.csv

read.csv(text = trimws(gsub("#\\d+", "", gsub("\\s+at\\s+", ",", df$text))),
        header = FALSE, col.names = c("T1", "T2"), stringsAsFactors = FALSE)
#            T1            T2
#1     Colorado    California
#2         Utah      Stanford
#3          USC    Wash State
#4         UCLA    Washington
#5   W Illinois     Neb Omaha
#6       Denver   S Dakota St
#7        IUPUI  South Dakota
#8         Rice    TX El Paso
#9  North Texas    TX-San Ant
#10        Iowa        Purdue