从文本中提取名称

时间:2019-03-26 05:40:15

标签: r regex

我正在尝试从字符串中提取橄榄球运动员的名字列表。该字符串包含表中的所有信息,其中包含标题(团队名称)以及每个团队在每个位置的球员名称。它也有玩家排名,但我不在乎。

请注意,1-15的数字表示位置,每个位置后面总是有两个名称(主场球员和客场球员)。

这是字符串:

"Team Sheets     #            LIO Lions      RPI             JAG Jaguares      RPI     1        Dylan Smith     83           Juan Pablo Zeiss     59     2        Malcolm Marx     90           Julian Montoya     73     3        Carlu Sadie     78           Enrique Pieretto Heilan     54     4        Ruan Vermaak     72           Guido Petti Pagadizaval     77     5        Rhyno Herbst     72           Matias Alemanno     67     6        Marnus Schoeman     82           Juan Manuel Leguizamon     58     7        Vincent Tshituka     64           Marcos Kremer     55     8        Kwagga Smith     88           Rodrigo Bruni     62     9        Ross Cronje     74           Martin Landajo     52     10        Elton Jantjies     80           Joaquin Diaz Bonilla     62     11        Courtnall Skosan     76           Emiliano Boffelli     75     12        Franco Naude     52           Bautista Ezcurra     66     13        Wandisile Simelane     73           Matias Moroni     75     14        Sylvian Mahuza     76           Sebastian Cancelliere     65     15        Andries Coetzee     73           Joaquin Tuculet     68      Substitutes      #            LIO Lions      RPI             JAG Jaguares      RPI     16        Pieter Jansen     58           Gaspar Baldunciel     61     17        Nathan McBeth     60           Santiago Garcia Botta     65     18        Frans van Wyk     58           Santiago Medrano     72     19        Stephan Lewies     81           Tomas Lavanini     68     20        James Venter     61           Tomas Lezana     62     21        Dillon Smit     61           Tomas Cubelli     63     22        Harold Vorster     69           Juan Cruz Mallia     66     23        Gianni Lombard     64           Ramiro Moyano     78"

所以基本上我想要的只是名称列表,以团队名称为标题

Lions            Jaguares

Dylan Smith      Juan Pablo Zeiss
Malcolm Marx     Julian Montoya
...              ...

任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:1)

虽然我同意R.S.的评论,即直接将数据读取为数据帧,但这是我使用正则表达式的解决方案:

# build a "player name - RPI" pattern
pattern = "[a-zA-Z]+(\\s[a-zA-Z]+)+\\s+\\d{1,2}"

# find all matches in string
m = gregexpr(pattern, x)

# extract all matches from string
plyrs = regmatches(x, m)[[1]]

# build dataframe
data.frame(lions = plyrs[c(TRUE, FALSE)],
           jaguares = plyrs[c(FALSE, TRUE)],
           stringsAsFactors=FALSE)

答案 1 :(得分:0)

首先,您可以尝试创建一个表结构而不是一个巨大的长字符串。 这样的事情可能会让您有所起点。

data = 'Team Sheets     #            LIO Lions      RPI             JAG Jaguares      RPI     1        Dylan Smith     83           Juan Pablo Zeiss     59     2        Malcolm Marx     90           Julian Montoya     73     3        Carlu Sadie     78           Enrique Pieretto Heilan     54     4        Ruan Vermaak     72           Guido Petti Pagadizaval     77     5        Rhyno Herbst     72           Matias Alemanno     67     6        Marnus Schoeman     82           Juan Manuel Leguizamon     58     7        Vincent Tshituka     64           Marcos Kremer     55     8        Kwagga Smith     88           Rodrigo Bruni     62     9        Ross Cronje     74           Martin Landajo     52     10        Elton Jantjies     80           Joaquin Diaz Bonilla     62     11        Courtnall Skosan     76           Emiliano Boffelli     75     12        Franco Naude     52           Bautista Ezcurra     66     13        Wandisile Simelane     73           Matias Moroni     75     14        Sylvian Mahuza     76           Sebastian Cancelliere     65     15        Andries Coetzee     73           Joaquin Tuculet     68      Substitutes      #            LIO Lions      RPI             JAG Jaguares      RPI     16        Pieter Jansen     58           Gaspar Baldunciel     61     17        Nathan McBeth     60           Santiago Garcia Botta     65     18        Frans van Wyk     58           Santiago Medrano     72     19        Stephan Lewies     81           Tomas Lavanini     68     20        James Venter     61           Tomas Lezana     62     21        Dillon Smit     61           Tomas Cubelli     63     22        Harold Vorster     69           Juan Cruz Mallia     66     23        Gianni Lombard     64           Ramiro Moyano     78'
import re
data = re.sub(r'(\s)\1{1,}', r'\1', data)
data = re.sub(r'RPI\s(\d+)', r'\n\1', data)
data = re.sub(r'(#)\s', r'\n\1', data)
print(re.sub(r'\d+\s(\d+)', r'\n\1', data))