将字符串向量转换为R

时间:2017-10-26 19:02:42

标签: r string dataframe data-manipulation

我正在制作一个快速抓取项目,涉及抓住历史NFL足球数据。以下是我的数据的快速浏览:

allgames_thisweek = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score", 
"Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT)  -- Box Score", 
"Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score", 
"Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score", 
"Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score", 
"New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score", 
"San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score", 
"Seattle Seahawks 28, Washington Redskins 7 -- Box Score")

allgames_thisweek[1]
"Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score"

每一行都有以下数据[team1,team1score,team2,team2score, - ,Box Score]

我的数据格式完全相同,这意味着在第一支球队的得分之后总是有一个逗号,总是有一个 - 在第二支球队得分之后。我想创建一个包含4列的数据框(team1,team1score,team2,team2score),因此输出可能如下所示:

output_df
            team1    team1score                  team2   team2score
1.  Chicago Bears            21  Tampba Bay Buccaneers            9

有关如何实现这一目标的任何想法?任何帮助表示赞赏!感谢

1 个答案:

答案 0 :(得分:6)

您可以使用dplyr + stringr

执行此操作
library(dplyr)
library(stringr)

string %>%
  str_replace("(?<=\\d)\\s.*--.+$", "") %>%
  str_replace_all("\\s(?=\\d+\\b)", ",") %>%
  strsplit(",") %>%
  do.call(rbind, .) %>%
  data.frame() %>%
  setNames(c("team1", "team1score", "team2", "team2score"))

<强>结果:

                 team1 team1score                 team2 team2score
1        Chicago Bears         21  Tampa Bay Buccaneers          9
2     Cleveland Browns         28    Cincinnati Bengals         20
3       Dallas Cowboys         26   Pittsburgh Steelers          9
4        Detroit Lions         31       Atlanta Falcons         28
5    Green Bay Packers         16     Minnesota Vikings         10
6   Indianapolis Colts         45        Houston Oilers         21
7   Kansas City Chiefs         30    New Orleans Saints         17
8     Los Angeles Rams         14     Arizona Cardinals         12
9       Miami Dolphins         39  New England Patriots         35
10     New York Giants         28   Philadelphia Eagles         23
11       New York Jets         23         Buffalo Bills          3
12  San Diego Chargers         37        Denver Broncos         34
13 San Francisco 49ers         44   Los Angeles Raiders         14
14    Seattle Seahawks         28   Washington Redskins          7

备注:

  1. (?<=\\d)\\s.*--.+$匹配一个空格(\\s),后跟任何字符零次或多次.*),文字--,任意字符一次或多次.+),结束字符串($)。此模式有一个额外的条件,它必须跟随数字(?<=\\d)
  2. (?<=...)被称为积极的lookbehind,它会检查之后的是否紧跟...中的模式。
  3. \\s(?=\\d+\\b)匹配紧跟在((?=...))数字一次或多次和字边界(\\b)之后的空格。所以这匹配了球队名称和球队得分之间的空间。
  4. (?=...)是一个积极的先行,它会检查之前的是否紧跟...中的模式。
  5. 数据:

    string = c("Chicago Bears 21, Tampa Bay Buccaneers 9 -- Box Score", "Cleveland Browns 28, Cincinnati Bengals 20 -- Box Score", 
           "Dallas Cowboys 26, Pittsburgh Steelers 9 -- Box Score", "Detroit Lions 31, Atlanta Falcons 28 (OT)  -- Box Score", 
           "Green Bay Packers 16, Minnesota Vikings 10 -- Box Score", "Indianapolis Colts 45, Houston Oilers 21 -- Box Score", 
           "Kansas City Chiefs 30, New Orleans Saints 17 -- Box Score", 
           "Los Angeles Rams 14, Arizona Cardinals 12 -- Box Score", "Miami Dolphins 39, New England Patriots 35 -- Box Score", 
           "New York Giants 28, Philadelphia Eagles 23 -- Box Score", "New York Jets 23, Buffalo Bills 3 -- Box Score", 
           "San Diego Chargers 37, Denver Broncos 34 -- Box Score", "San Francisco 49ers 44, Los Angeles Raiders 14 -- Box Score", 
           "Seattle Seahawks 28, Washington Redskins 7 -- Box Score")