我在大学篮球比赛中有一个对决的载体:
c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue"
)
我想要两个单独的向量:一个用于at
之前的团队,另一个用于在at
之后的团队。例如,第一个向量将具有Colorado
,Utah
,USC
等,第二个向量将具有California
,Stanford
,Wash State
,等等
注意我不希望#排名。我只想要球队名称。我尝试过str_split
ing,但是由于间距不一致,因此效果不太好。
答案 0 :(得分:1)
我们可以使用strsplit
并在“ at”处分割,这将给我们两部分字符串,然后从每个部分中删除“#”,后跟数字,并将其放入数据帧中。
data.frame(t(sapply(strsplit(string, "\\bat\\b"),
function(x) trimws(sub("#[0-9]+", "", x)))))
# X1 X2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
或使用tidyr::separate
tidyr::separate(data.frame(col = trimws(gsub("#[0-9]+", "", string))),
col, into = c("T1", "T2"), sep = "\\bat\\b")
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
答案 1 :(得分:1)
使用str_extract_all()
的另一种解决方案
df <- data.frame(stringsAsFactors = FALSE,
text = c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue")
)
library(stringr)
library(dplyr)
df %>%
mutate(team_a = str_extract_all(text, "(?<=\\s).+(?=\\s+at)"),
team_b = str_extract_all(text, "(?<=\\d\\s)[^\\d]+$"))
#> text team_a team_b
#> 1 #34 Colorado at #36 California Colorado California
#> 2 #31 Utah at #87 Stanford Utah Stanford
#> 3 #26 USC at #112 Wash State USC Wash State
#> 4 #56 UCLA at #134 Washington UCLA Washington
#> 5 #187 W Illinois at #116 Neb Omaha W Illinois Neb Omaha
#> 6 #222 Denver at #58 S Dakota St Denver S Dakota St
#> 7 #245 IUPUI at #170 South Dakota IUPUI South Dakota
#> 8 #268 Rice at #208 TX El Paso Rice TX El Paso
#> 9 #274 North Texas at #344 TX-San Ant North Texas TX-San Ant
#> 10 #14 Iowa at #3 Purdue Iowa Purdue
由reprex package(v0.2.1)于2019-03-29创建
答案 2 :(得分:0)
我们可以在base R
中执行此操作,方法是从“文本”列中删除子字符串,然后使用read.csv
read.csv(text = trimws(gsub("#\\d+", "", gsub("\\s+at\\s+", ",", df$text))),
header = FALSE, col.names = c("T1", "T2"), stringsAsFactors = FALSE)
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue