我正在尝试从字符串中提取橄榄球运动员的名字列表。该字符串包含表中的所有信息,其中包含标题(团队名称)以及每个团队在每个位置的球员名称。它也有玩家排名,但我不在乎。
请注意,1-15的数字表示位置,每个位置后面总是有两个名称(主场球员和客场球员)。
这是字符串:
"Team Sheets # LIO Lions RPI JAG Jaguares RPI 1 Dylan Smith 83 Juan Pablo Zeiss 59 2 Malcolm Marx 90 Julian Montoya 73 3 Carlu Sadie 78 Enrique Pieretto Heilan 54 4 Ruan Vermaak 72 Guido Petti Pagadizaval 77 5 Rhyno Herbst 72 Matias Alemanno 67 6 Marnus Schoeman 82 Juan Manuel Leguizamon 58 7 Vincent Tshituka 64 Marcos Kremer 55 8 Kwagga Smith 88 Rodrigo Bruni 62 9 Ross Cronje 74 Martin Landajo 52 10 Elton Jantjies 80 Joaquin Diaz Bonilla 62 11 Courtnall Skosan 76 Emiliano Boffelli 75 12 Franco Naude 52 Bautista Ezcurra 66 13 Wandisile Simelane 73 Matias Moroni 75 14 Sylvian Mahuza 76 Sebastian Cancelliere 65 15 Andries Coetzee 73 Joaquin Tuculet 68 Substitutes # LIO Lions RPI JAG Jaguares RPI 16 Pieter Jansen 58 Gaspar Baldunciel 61 17 Nathan McBeth 60 Santiago Garcia Botta 65 18 Frans van Wyk 58 Santiago Medrano 72 19 Stephan Lewies 81 Tomas Lavanini 68 20 James Venter 61 Tomas Lezana 62 21 Dillon Smit 61 Tomas Cubelli 63 22 Harold Vorster 69 Juan Cruz Mallia 66 23 Gianni Lombard 64 Ramiro Moyano 78"
所以基本上我想要的只是名称列表,以团队名称为标题
Lions Jaguares
Dylan Smith Juan Pablo Zeiss
Malcolm Marx Julian Montoya
... ...
任何帮助将不胜感激!
答案 0 :(得分:1)
虽然我同意R.S.的评论,即直接将数据读取为数据帧,但这是我使用正则表达式的解决方案:
# build a "player name - RPI" pattern
pattern = "[a-zA-Z]+(\\s[a-zA-Z]+)+\\s+\\d{1,2}"
# find all matches in string
m = gregexpr(pattern, x)
# extract all matches from string
plyrs = regmatches(x, m)[[1]]
# build dataframe
data.frame(lions = plyrs[c(TRUE, FALSE)],
jaguares = plyrs[c(FALSE, TRUE)],
stringsAsFactors=FALSE)
答案 1 :(得分:0)
首先,您可以尝试创建一个表结构而不是一个巨大的长字符串。 这样的事情可能会让您有所起点。
data = 'Team Sheets # LIO Lions RPI JAG Jaguares RPI 1 Dylan Smith 83 Juan Pablo Zeiss 59 2 Malcolm Marx 90 Julian Montoya 73 3 Carlu Sadie 78 Enrique Pieretto Heilan 54 4 Ruan Vermaak 72 Guido Petti Pagadizaval 77 5 Rhyno Herbst 72 Matias Alemanno 67 6 Marnus Schoeman 82 Juan Manuel Leguizamon 58 7 Vincent Tshituka 64 Marcos Kremer 55 8 Kwagga Smith 88 Rodrigo Bruni 62 9 Ross Cronje 74 Martin Landajo 52 10 Elton Jantjies 80 Joaquin Diaz Bonilla 62 11 Courtnall Skosan 76 Emiliano Boffelli 75 12 Franco Naude 52 Bautista Ezcurra 66 13 Wandisile Simelane 73 Matias Moroni 75 14 Sylvian Mahuza 76 Sebastian Cancelliere 65 15 Andries Coetzee 73 Joaquin Tuculet 68 Substitutes # LIO Lions RPI JAG Jaguares RPI 16 Pieter Jansen 58 Gaspar Baldunciel 61 17 Nathan McBeth 60 Santiago Garcia Botta 65 18 Frans van Wyk 58 Santiago Medrano 72 19 Stephan Lewies 81 Tomas Lavanini 68 20 James Venter 61 Tomas Lezana 62 21 Dillon Smit 61 Tomas Cubelli 63 22 Harold Vorster 69 Juan Cruz Mallia 66 23 Gianni Lombard 64 Ramiro Moyano 78'
import re
data = re.sub(r'(\s)\1{1,}', r'\1', data)
data = re.sub(r'RPI\s(\d+)', r'\n\1', data)
data = re.sub(r'(#)\s', r'\n\1', data)
print(re.sub(r'\d+\s(\d+)', r'\n\1', data))