我想从html的嵌套表中提取一些值。我尝试过漂亮,但没有成功。感谢任何人都可以提供帮助。
html就像这样:
<table style="border-width:0px;width:100%;">
<tr valign="middle">
<td style="width:400px;"><span><span style='font-size: 12px;'>Race 1</span><br /><br /></span><span><span style='font-size: 12px;'><strong>Grade:</strong> M 400 metres</span>
<br /></span>
<span><span style='font-size: 12px;'><strong>Prize Money:</strong> $1180</span> $825 - $235 - $120<br /><br /></span>
<table>
<tr valign="middle">
<td style="width:105px;"><span>Race Time:</span></td><td align="left" style="width:50px;"><span>(8.44)</span></td><td align="left" style="width:50px;"><span>(0.00)</span></td><td align="left" style="width:50px;"><span>(22.95)</span></td><td></td>
</tr><tr valign="middle">
<td style="width:105px;"><span>Sectional Time:</span></td><td align="left" style="width:50px;"><span>8.44</span></td><td align="left" style="width:50px;"><span>0.00</span></td><td align="left" style="width:50px;"><span>14.51</span></td><td></td>
</tr><tr valign="middle">
<td style="width:150px;"><span>1<sup>st</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' /> <img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' /> <img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' /> <img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' /> <img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' /> </span></td>
</tr><tr valign="middle">
<td><span>2<sup>nd</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' /> <img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' /> <img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' /> <img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' /> <img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' /> </span></td>
</tr>
</table> </td>
<td class="ResultsPageRightColumn" valign="bottom"></td>
</tr>
</table>
我的预期结果是这样的:
非常感谢!
答案 0 :(得分:0)
我设计出如下代码,它可以工作,但就代码优雅性而言可能并不完美。
def getRaceInfo(rawString):
pattern = re.compile(r"(<span style='font-size: 12px;'>Race).*?\d+(<br />)", re.IGNORECASE)
matches = pattern.finditer(rawString)
#print(matches)
spanlist = []
for m in matches:
spanlist.append(m.span())
raceI= []
replacements = [("<span style='font-size: 12px;'>", ''),
(' ', '|'), (' ', ''),
('<span>', ''), ('<br />', ''), ('<strong>', ''), ('</strong>',''),
('</span>', '|'), ('||', '|')
]
for (i, j) in spanlist:
nString = rawString[i:j]
for t, r in replacements:
nString = nString.replace(t, r)
raceI.append(nString)
if len(raceI) != 0:
df = pd.DataFrame(raceI,columns=['temp'])
df[['RACE_NUM','Race_Grade','Race_Distance','Race_PrizeMoney1','Race_PrizeMoney2']] = df['temp'].str.split('|',expand=True)
df['Race_Other'] = np.nan
try:
df[['RACE_NUM','RACE_Detail']] = df['RACE_NUM'].str.split(' :: ',expand=True)
except:
df['RACE_Detail']=np.nan
df = df.drop(['temp'], axis=1)
else:
pattern = re.compile(r"(font-size: 12px;\'>Race).*?(</span><br /><br /></span><table>)", re.IGNORECASE)
matches = pattern.finditer(rawString)
spanlist = []
for m in matches:
spanlist.append(m.span())
raceI= []
replacements = [("font-size: 12px;\'>", ''), ("<span style='", ''), ('<table>', ''),
(' ', '|'), (' ', ''),
('<span>', ''), ('<br />', ''), ('<strong>', ''), ('</strong>',''),
('</span>', '|'), ('||', '|'), ('><span style=\'font-size: 12px;\'>', '|')
]
for (i, j) in spanlist:
nString = rawString[i:j]
for t, r in replacements:
nString = nString.replace(t, r)
raceI.append(nString)
if len(raceI) != 0:
df = pd.DataFrame(raceI,columns=['temp'])
df[['RACE_NUM','Race_Grade','Race_Distance','Race_Other', 'Race_PrizeMoney1']] = df['temp'].str.split('|',expand=True)
df['Race_PrizeMoney2'] = np.nan
try:
df[['RACE_NUM','RACE_Detail']] = df['RACE_NUM'].str.split(' :: ',expand=True)
except:
df['RACE_Detail']=np.nan
df = df.drop(['temp'], axis=1)
else:
df = pd.DataFrame(columns=['RACE_NUM','Race_Grade','Race_Distance','Race_Other', 'RACE_Detail', 'Race_PrizeMoney1', 'Race_PrizeMoney2'])
return df