我有以下数据:
import PyPDF2
pdf_file = open("123.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content
我希望从以下内容中提取数据:
page_content
Out[157]: "RiderNatio\nn Motorcycle\nTotal Time\nPosKm/hGap\nTeam \nGRAND PRIX OF QATAR\nResults and timing service provided by\n5380 m.osail International Circ\nuMotoGPŽ\nRaceClassification after 20 laps = 107.6 km\n2925YAMAHA\nMaverick VIÑALES\nSPA138'59.999\n165.5\n25Movistar Yamaha MotoGP\n4DUCATI\nAndrea DOVIZIOSO\nITA239'00.460\n165.50.461\n20Ducati Team\n46YAMAHA\nValentino ROSSI\nITA339'01.927\n165.41.928\n16Movistar Yamaha MotoGP\n93HONDAMarc MARQUEZ\nSPA439'06.744\n165.06.745\n13Repsol Honda Team\n26HONDADani PEDROS\nASPA539'07.127\n165.07.128\n11Repsol Honda Team\n41APRILIA\nAleix ESPARGARO\nSPA639'07.660\n164.97.661\n10Aprilia Racing Team Gresini\n45DUCATI\nScott REDDING\nGBR\n739'09.781\n164.89.782\n9OCTO Pramac Racing\n43HONDAJack MILLERAUS\n839'14.485\n164.514.486\n8EG 0,0 Marc VDS\n42SUZUKI\nAlex RINS\nSPA939'14.787\n164.414.788\n7Team SUZUKI ECSTAR\n94YAMAHA\nJonas FOLGER\nGER\n1039'15.068\n164.415.069\n6Monster Yamaha Tech 3\n99DUCATI\nJorge LORENZO\nSPA1139'20.515\n164.020.516\n5Ducati Team\n76DUCATI\nLoris BAZ\nFRA\n1239'21.254\n164.021.255\n4Reale Avintia Racing\n8DUCATI\nHector BARBER\nASPA1339'28.827\n163.528.828\n3Reale Avintia Racing\n17DUCATI\nKarel ABRAHAM\nCZE\n1439'29.122\n163.529.123\n2Pull&Bear Aspar Team\n53HONDATito RABAT\nSPA1539'29.469\n163.429.470\n1EG 0,0 Marc VDS\n44KTMPol ESPARGARO\nSPA1639'33.600\n163.133.601\nRed Bull KTM Factory Racing\n38KTMBradle\ny SMITH\nGBR\n1739'39.703\n162.739.704\nRed Bull KTM Factory Racing\n22APRILIA\nSam LOWESGBR\n1839'47.130\n162.247.131\nAprilia Racing Team Gresini\nNot Classified\n9DUCATI\nDanilo PETRUCCI\nITA27'31.191\n164.26 laps\nOCTO Pramac Racing\n29SUZUKI\nAndrea IANNONE\nITA19'34.409\n164.910 laps\nTeam SUZUKI ECSTAR\n19DUCATI\nAlvaro BAUTISTA\nSPA13'46.030\n164.113 laps\nPull&Bear Aspar Team\n5YAMAHA\nJohann ZARCO\nFRA\n11'44.661\n164.914 laps\nMonster Yamaha Tech 3\n35HONDACal CRUTCHLOW\nGBR\n8'44.974\n147.516 laps\nLCR HondaDryAir: 21°\nGround: 22°\nHumidity: 96%\nPole Position:\nFastest Lap:\nMaverick VIÑALES\n1'54.316\n169.4 Km/h\nJohann ZARCO\n1'55.990\n166.9 Km/h\nLap 4Circuit Record Lap:\nCircuit Best Lap:\nJorge LORENZO\n1'54.927\n168.5 Km/h\nJorge LORENZO\n1'53.927\n170.0 Km/h\n2008\n2016\nRace condition:\nSIGHTING LAP START\n 20:40'00\nSIGHTING LAP START\n 21:15'00\nStart delayed\n 21:21'25WARM UP LAP START\n 21:40'00\nRACE START\n 21:45'16\nNo jump start\n 21:46'06\ncrashed out - Rider OK\nCal CRUTCHLOW\n21:53'13re-joined race\nCal CRUTCHLOW\n21:53'57crashed out - Rider OK\nCal CRUTCHLOW\n21:56'08crashed out - Rider OK\nJohann ZARCO\n21:57'16crashed out - Rider OK\nAlvaro BAUTISTA\n22:00'51crashed out - Rider OK\nAndrea IANNONE\n22:05'29retired\nDanilo PETRUCCI\n22:15'06Time limit for protest expires 30' afte\nr publication of the results - Mr. ...................................................\n...... Time: ...................................\nThe results are provisional until the end of the limit for protest and appeals.\nDoha, Sunday, March 26, 2017\nThese data/results cannot be reproduced, stor\ned and/or transmitted in whole or in part \nby any manner of electronic, mechanical,\n photocopying, recording, broadcasting or otherwise now \nknown or herein after developed without the pr\nevious express consent by \nthe copyright owner, except for reproduction in daily p\nress and regular printed publications on sale to the public \nwithin 60 days of the event related to those data/results and \nalways provided that copyright symbol appears together as follows\n below.\n© DORNA, 2017\nOfficial MotoGP Timing by \nwww.mot\nogp.com\nTISSOT\n"
我想处理它并用它创建一个.csv,这样我就可以将它存储在一个数据框中并用它进行分析。我不知道如何清理它。
我尝试过:
pgs = page_content.split()
pgs[pgs.index("km")+1:pgs.index("Classified")-1]
Out[183]:
['2925YAMAHA',
'Maverick',
'VIÑALES',
"SPA138'59.999",
'165.5',
'25Movistar',
'Yamaha',
'MotoGP',
'4DUCATI',
'Andrea',
'DOVIZIOSO',
"ITA239'00.460",
'165.50.461',
'20Ducati',
'Team',
'46YAMAHA',
'Valentino',
'ROSSI',
"ITA339'01.927",
'165.41.928',
'16Movistar',
'Yamaha',
'MotoGP',
'93HONDAMarc',
'MARQUEZ',
"SPA439'06.744",
'165.06.745',
'13Repsol',
'Honda',
'Team',
'26HONDADani',
'PEDROS',
"ASPA539'07.127",
'165.07.128',
'11Repsol',
'Honda',
'Team',
'41APRILIA',
'Aleix',
'ESPARGARO',
"SPA639'07.660",
'164.97.661',
'10Aprilia',
'Racing',
'Team',
'Gresini',
'45DUCATI',
'Scott',
'REDDING',
'GBR',
"739'09.781",
'164.89.782',
'9OCTO',
'Pramac',
'Racing',
'43HONDAJack',
'MILLERAUS',
"839'14.485",
'164.514.486',
'8EG',
'0,0',
'Marc',
'VDS',
'42SUZUKI',
'Alex',
'RINS',
"SPA939'14.787",
'164.414.788',
'7Team',
'SUZUKI',
'ECSTAR',
'94YAMAHA',
'Jonas',
'FOLGER',
'GER',
"1039'15.068",
'164.415.069',
'6Monster',
'Yamaha',
'Tech',
'3',
'99DUCATI',
'Jorge',
'LORENZO',
"SPA1139'20.515",
'164.020.516',
'5Ducati',
'Team',
'76DUCATI',
'Loris',
'BAZ',
'FRA',
"1239'21.254",
'164.021.255',
'4Reale',
'Avintia',
'Racing',
'8DUCATI',
'Hector',
'BARBER',
"ASPA1339'28.827",
'163.528.828',
'3Reale',
'Avintia',
'Racing',
'17DUCATI',
'Karel',
'ABRAHAM',
'CZE',
"1439'29.122",
'163.529.123',
'2Pull&Bear',
'Aspar',
'Team',
'53HONDATito',
'RABAT',
"SPA1539'29.469",
'163.429.470',
'1EG',
'0,0',
'Marc',
'VDS',
'44KTMPol',
'ESPARGARO',
"SPA1639'33.600",
'163.133.601',
'Red',
'Bull',
'KTM',
'Factory',
'Racing',
'38KTMBradle',
'y',
'SMITH',
'GBR',
"1739'39.703",
'162.739.704',
'Red',
'Bull',
'KTM',
'Factory',
'Racing',
'22APRILIA',
'Sam',
'LOWESGBR',
"1839'47.130",
'162.247.131',
'Aprilia',
'Racing',
'Team',
'Gresini']
不过,我应该从MotorCycle品牌开始分离并将其转换为数据框。也许有比我正在使用的方法更好的方法。
以HTML格式提取数据时,我得到:
b'<html><head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n</head><body>\n<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>\n<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>\n<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:77px; width:94px; height:11px;"><span style="font-family: b\'ArialMT\'; font-size:11px">osail International Circu\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:77px; width:188px; height:14px;"><span style="font-family: b\'ArialMT\'; font-size:14px">Results and timing service provided by\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:149px; top:113px; width:257px; height:55px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:16px">GRAND PRIX OF QATAR\n<br>Race\n<br>Classification after 20 laps = 107.6 km\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:156px; width:32px; height:11px;"><span style="font-family: b\'ArialMT\'; font-size:11px">5380 m.\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:458px; top:89px; width:106px; height:25px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:25px">MotoGP\xe2\x84\xa2\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:541px; top:152px; width:21px; height:20px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:20px">29\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:59px; top:189px; width:19px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Pos\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:189px; width:27px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Rider\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:189px; width:32px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Nation\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:249px; top:189px; width:30px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Team \n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:189px; width:107px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px"> Motorcycle Total Time\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:481px; top:189px; width:26px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Km/h\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:538px; top:189px; width:21px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Gap\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:226px; width:7px; height:174px;"><span style="font-family: b\'ArialMT\'; font-size:9px">25\n<br>20\n<br>16\n<br>13\n<br>11\n<br>10\n<br>9\n<br>8\n<br>7\n<br>6\n<br>5\n<br>4\n<br>3\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">2\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">1\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:64px; top:225px; width:10px; height:213px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">1\n<br>2\n<br>3\n<br>4\n<br>5\n<br>6\n<br>7\n<br>8\n<br>9\n<br>10\n<br>11\n<br>12\n<br>13\n<br>14\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">15\n<br>16\n<br>17\n<br>18\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:97px; top:225px; width:10px; height:212px;"><span style="font-family: b\'ArialMT\'; font-size:11px">25\n<br>4\n<br>46\n<br>93\n<br>26\n<br>41\n<br>45\n<br>43\n<br>42\n<br>94\n<br>99\n<br>76\n<br>8\n<br></span><span style="font-family: b\'ArialMT\'; font-size:11px">17\n<br>53\n<br>44\n<br>38\n<br>22\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:225px; width:83px; height:213px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Maverick VI\xc3\x91ALES\n<br>Andrea DOVIZIOSO\n<br>Valentino ROSSI\n<br>Marc MARQUEZ\n<br>Dani PEDROSA\n<br>Aleix ESPARGARO\n<br>Scott REDDING\n<br>Jack MILLER\n<br>Alex RINS\n<br>Jonas FOLGER\n<br>Jorge LORENZO\n<br>Loris BAZ\n<br>Hector BARBERA\n<br>Karel ABRAHAM\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Tito RABAT\n<br>Pol ESPARGARO\n<br>Bradley SMITH\n<br>Sam LOWES\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:57px; top:440px; width:60px; height:12px;"><span style="font-family: b\'Arial-BoldItalicMT\'; font-size:12px">Not Classified\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:97px; top:452px; width:10px; height:59px;"><span style="font-family: b\'ArialMT\'; font-size:11px">9\n<br>29\n<br>19\n<br>5\n<br>35\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:452px; width:76px; height:59px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Danilo PETRUCCI\n<br>Andrea IANNONE\n<br>Alvaro BAUTISTA\n<br>Johann ZARCO\n<br>Cal CRUTCHLOW\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:227px; top:226px; width:17px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">SPA\n<br>ITA\n<br>ITA\n<br>SPA\n<br>SPA\n<br>SPA\n<br>GBR\n<br>AUS\n<br>SPA\n<br>GER\n<br>SPA\n<br>FRA\n<br>SPA\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">CZE\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">SPA\n<br>SPA\n<br>GBR\n<br>GBR\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:227px; top:452px; width:17px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">ITA\n<br>ITA\n<br>SPA\n<br>FRA\n<br>GBR\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:250px; top:226px; width:105px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">Movistar Yamaha MotoGP\n<br>Ducati Team\n<br>Movistar Yamaha MotoGP\n<br>Repsol Honda Team\n<br>Repsol Honda Team\n<br>Aprilia Racing Team Gresini\n<br>OCTO Pramac Racing\n<br>EG 0,0 Marc VDS\n<br>Team SUZUKI ECSTAR\n<br>Monster Yamaha Tech 3\n<br>Ducati Team\n<br>Reale Avintia Racing\n<br>Reale Avintia Racing\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">Pull&Bear Aspar Team\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">EG 0,0 Marc VDS\n<br>Red Bull KTM Factory Racing\n<br>Red Bull KTM Factory Racing\n<br>Aprilia Racing Team Gresini\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:250px; top:452px; width:88px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">OCTO Pramac Racing\n<br>Team SUZUKI ECSTAR\n<br>Pull&Bear Aspar Team\n<br>Monster Yamaha Tech 3\n<br>LCR Honda\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:384px; top:226px; width:34px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">YAMAHA\n<br>DUCATI\n<br>YAMAHA\n<br>HONDA\n<br>HONDA\n<br>APRILIA\n<br>DUCATI\n<br>HONDA\n<br>SUZUKI\n<br>YAMAHA\n<br>DUCATI\n<br>DUCATI\n<br>DUCATI\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">DUCATI\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">HONDA\n<br>KTM\n<br>KTM\n<br>APRILIA\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:384px; top:452px; width:33px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">DUCATI\n<br>SUZUKI\n<br>DUCATI\n<br>YAMAHA\n<br>HONDA\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:225px; width:35px; height:211px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">38\'59.999\n<br>39\'00.460\n<br>39\'01.927\n<br>39\'06.744\n<br>39\'07.127\n<br>39\'07.660\n<br>39\'09.781\n<br>39\'14.485\n<br>39\'14.787\n<br>39\'15.068\n<br>39\'20.515\n<br>39\'21.254\n<br>39\'28.827\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">39\'29.122\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">39\'29.469\n<br>39\'33.600\n<br>39\'39.703\n<br>39\'47.130\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:452px; width:35px; height:58px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">27\'31.191\n<br>19\'34.409\n<br>13\'46.030\n<br>11\'44.661\n<br>8\'44.974\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:492px; top:226px; width:20px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">165.5\n<br>165.5\n<br>165.4\n<br>165.0\n<br>165.0\n<br>164.9\n<br>164.8\n<br>164.5\n<br>164.4\n<br>164.4\n<br>164.0\n<br>164.0\n<br>163.5\n<br>163.5\n<br>163.4\n<br>163.1\n<br>162.7\n<br>162.2\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:492px; top:452px; width:20px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">164.2\n<br>164.9\n<br>164.1\n<br>164.9\n<br>147.5\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:538px; top:237px; width:24px; height:199px;"><span style="font-family: b\'ArialMT\'; font-size:10px">0.461\n<br>1.928\n<br>6.745\n<br>7.128\n<br>7.661\n<br>9.782\n<br>14.486\n<br>14.788\n<br>15.069\n<br>20.516\n<br>21.255\n<br>28.828\n<br>29.123\n<br>29.470\n<br>33.601\n<br>39.704\n<br>47.131\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:537px; top:452px; width:25px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">6 laps\n<br>10 laps\n<br>13 laps\n<br>14 laps\n<br>16 laps\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:68px; top:526px; width:57px; height:10px;"><span style="font-family: b\'Arial-ItalicMT\'; font-size:10px">Race condition:\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:89px; top:528px; width:56px; height:41px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:11px">Dry\n<br></span><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:9px">Air: 21\xc2\xb0\n<br>Humidity: 96%\n<br>Ground: 22\xc2\xb0\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:526px; width:70px; height:42px;"><span style="font-family: b\'Arial-ItalicMT\'; font-size:10px">Pole Position:\n<br>Fastest Lap:\n<br>Circuit Record Lap:\n<br>Circuit Best Lap:\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:292px; top:537px; width:20px; height:31px;"><span style="font-family: b\'ArialMT\'; font-size:10px">Lap 4\n<br>2016\n<br>2008\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:287px; top:573px; width:31px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">20:40\'00\n<br>21:15\'00\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">21:21\'25\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">21:40\'00\n<br>21:45\'16\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:9px">21:46\'06\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">21:53\'13\n<br>21:53\'57\n<br>21:56\'08\n<br>21:57\'16\n<br>22:00\'51\n<br>22:05\'29\n<br>22:15\'06\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:526px; width:71px; height:11px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">Maverick VI\xc3\x91ALES\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:351px; top:536px; width:63px; height:32px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">Johann ZARCO\n<br>Jorge LORENZO\n<br>Jorge LORENZO\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:474px; top:526px; width:30px; height:42px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">1\'54.316\n<br>1\'55.990\n<br>1\'54.927\n<br>1\'53.927\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:517px; top:526px; width:41px; height:41px;"><span style="font-family: b\'ArialMT\'; font-size:10px">169.4 Km/h\n<br>166.9 Km/h\n<br>168.5 Km/h\n<br>170.0 Km/h\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:573px; width:57px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px"> \n<br> \n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px"> \n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px"> \n<br> \n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:9px"> \n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">Cal CRUTCHLOW\n<br>Cal CRUTCHLOW\n<br>Cal CRUTCHLOW\n<br>Johann ZARCO\n<br>Alvaro BAUTISTA\n<br>Andrea IANNONE\n<br>Danilo PETRUCCI\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:447px; top:573px; width:85px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">SIGHTING LAP START\n<br>SIGHTING LAP START\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">Start '
答案 0 :(得分:1)
一旦我得到了html,我就用它清理它:
import lxml.html.clean as lhc
和
from bs4 import BeautifulSoup as bs
motobs = bs(motoh)
motobsg = bs.get_text(motobs)
mbs = str(motobsg)
mbss = mbs.split()
从那里我必须编写一个函数来查找这些对象之间的关系,这样我就可以构造一个数据框:
mbsd
Out[216]:
['1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
'10',
'11',
'12',
'13',
'14',
'15',
'16',
'25',
'46',
'35',
'19',
'5',
'94',
'9',
'45',
'43',
'17',
'76',
'53',
'8',
'44',
'38',
'29',
'Maverick',
'VIÑALES',
'Valentino',
'ROSSI',
'Cal',
'CRUTCHLOW',
'Alvaro',
'BAUTISTA',
'Johann',
'ZARCO',
'Jonas',
'FOLGER',
'Danilo',
'PETRUCCI',
'Scott',
'REDDING',
'Jack',
'MILLER',
'Karel',
'ABRAHAM',
'Loris',
'BAZ',
'Tito',
'RABAT',
'Hector',
'BARBERA',
'Pol',
'ESPARGARO',
'Bradley',
'SMITH',
'Andrea',
'IANNONE',
'Not',
'Classified',
'4',
'41',
'26',
'22',
'42',
'93',
'99',
'Andrea',
'DOVIZIOSO',
'Aleix',
'ESPARGARO',
'Dani',
'PEDROSA',
'Sam',
'LOWES',
'Alex',
'RINS',
'Marc',
'MARQUEZ',
'Jorge',
'LORENZO',
'SPA',
'ITA',
'GBR',
'SPA',
'FRA',
'GER',
'ITA',
'GBR',
'AUS',
'CZE',
'FRA',
'SPA',
'SPA',
'SPA',
'GBR',
'ITA',
'Movistar',
'Yamaha',
'MotoGP',
'Movistar',
'Yamaha',
'MotoGP',
'LCR',
'Honda',
'Pull&Bear',
'Aspar',
'Team',
'Monster',
'Yamaha',
'Tech',
'3',
'Monster',
'Yamaha',
'Tech',
'3',
'OCTO',
'Pramac',
'Racing',
'OCTO',
'Pramac',
'Racing',
'EG',
'0,0',
'Marc',
'VDS',
'Pull&Bear',
'Aspar',
'Team',
'Reale',
'Avintia',
'Racing',
'EG',
'0,0',
'Marc',
'VDS',
'Reale',
'Avintia',
'Racing',
'Red',
'Bull',
'KTM',
'Factory',
'Racing',
'Red',
'Bull',
'KTM',
'Factory',
'Racing',
'Team',
'SUZUKI',
'ECSTAR',
'Ducati',
'Team',
'Aprilia',
'Racing',
'Team',
'Gresini',
'Repsol',
'Honda',
'Team',
'Aprilia',
'Racing',
'Team',
'Gresini',
'Team',
'SUZUKI',
'ECSTAR',
'Repsol',
'Honda',
'Team',
'ITA',
'SPA',
'SPA',
'GBR',
'SPA',
'SPA',
'SPA',
'Ducati',
'Team',
'YAMAHA',
'YAMAHA',
'HONDA',
'DUCATI',
'YAMAHA',
'YAMAHA',
'DUCATI',
'DUCATI',
'HONDA',
'DUCATI',
'DUCATI',
'HONDA',
'DUCATI',
'KTM',
'KTM',
'SUZUKI',
'DUCATI',
'APRILIA',
'HONDA',
'APRILIA',
'SUZUKI',
'HONDA',
'DUCATI',
"41'45.060",
"41'47.975",
"41'48.814",
"41'51.583",
"42'00.564",
"42'03.301",
"42'05.106",
"42'10.540",
"42'10.725",
"42'11.463",
"42'12.012",
"42'26.935",
"42'27.830",
"42'28.145",
"42'28.512",
"42'31.279",
"23'31.497",
"23'31.661",
"21'48.977",
"18'51.906",
"19'14.623",
"5'02.050",
'172.6',
'172.4',
'172.4',
'172.2',
'171.6',
'171.4',
'171.2',
'170.9',
'170.9',
'170.8',
'170.8',
'169.8',
'169.7',
'169.7',
'169.7',
'169.5',
'171.6',
'171.5',
'171.8',
'168.1',
'164.8',
'171.8',
'2.915',
'3.754',
'6.523',
'15.504',
'18.241',
'20.046',
'25.480',
'25.665',
'26.403',
'26.952',
'41.875',
'42.770',
'43.085',
'43.452',
'46.219',
'11',
'laps',
'11',
'laps',
'12',
'laps',
'14',
'laps',
'14',
'laps',
'22',
'laps',
'Race',
'condition:',
'Dry',
'Air:',
'20°',
'Humidity:',
'60%',
'Ground:',
'25°']