我正在使用PHP和libtidy来尝试筛选可能是历史上最糟糕和最不正确的HTML表格使用情况。该网站关闭了几个table,tr,td,font或粗体标签,并在表格中不断嵌套许多不同的表格层。
示例摘录:
<center>
<table border="1" bordercolor="#000000" cellspacing="0" cellpadding="0">
<tr>
<td width="50%">
<center>
Home Team - <b>Wildcats<td>
<center>
Away Team - <b>Polar Bears<tr>
<td colspan="2">
<center>
<b><font size="+1">Rosters<tr>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">1 <td>
<font size="2">Baird, T<tr>
<td>
<font size="2">2 <td>
<font size="2">Knight, P<tr>
<td>
<font size="2">8 <td>
<font size="2">Miller, B<tr>
<td>
<font size="2">9 <td>
<font size="2">Huebsch, B<tr>
<td>
<font size="2">11 <td>
<font size="2">Buschmann, C<tr>
<td>
<font size="2">12 <td>
<font size="2">Reding, J<tr>
<td>
<font size="2">14 <td>
<font size="2">Simpson, S<tr>
<td>
<font size="2">27 <td>
<font size="2">Kupferschmidt, M<tr>
<td>
<font size="2">28 <td>
<font size="2">Anderson, D<tr>
<td>
<font size="2">31 <td>
<font size="2">Gehrts, J<tr>
<td>
<font size="2">39 <td>
<font size="2">McGinnis, G<tr>
<td>
<font size="2">42 <td>
<font size="2">Temple, B<tr>
<td>
<font size="2">44 <td>
<font size="2">Kemplin, A<tr>
<td>
<font size="2">77 <td>
<font size="2">Weiner, B<tr>
<td>
<font size="2">95 <td>
<font size="2">
Zytkoskie, D</table>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">5 <td>
<font size="2">Mack, A<tr>
<td>
<font size="2">8 <td>
<font size="2">Foucault, R<tr>
<td>
<font size="2">11 <td>
<font size="2">Oberpriller, D *<tr>
<td>
<font size="2">12 <td>
<font size="2">Underwood, J<tr>
<td>
<font size="2">15 <td>
<font size="2">Oberpriller, M<tr>
<td>
<font size="2">19 <td>
<font size="2">Langfus, B<tr>
<td>
<font size="2">25 <td>
<font size="2">Carroll, R<tr>
<td>
<font size="2">30 <td>
<font size="2">Hirdler, T<tr>
<td>
<font size="2">33 <td>
<font size="2">Gibson, S<tr>
<td>
<font size="2">35 <td>
<font size="2">Marthaler, C<tr>
<td>
<font size="2">44 <td>
<font size="2">Yurik, J<tr>
<td>
<font size="2">58 <td>
<font size="2">
Gronemeyer, S</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Goals<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Kupferschmidt, M<td>
<font size="2">1<td>
<font size="2">12:51<td nowrap>
<font size="2">Kemplin, A<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">McGinnis, G<td>
<font size="2">1<td>
<font size="2">12:33<td nowrap>
<font size="2">Huebsch, B<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Kupferschmidt, M<td>
<font size="2">2<td>
<font size="2">16:01<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">3<td>
<font size="2">00:38<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Oberpriller, D *<td>
<font size="2">3<td>
<font size="2">12:31<td nowrap>
<font size="2">Gronemeyer, S<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Penalties<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:11<td>
<font size="2">09:11<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Unsportmanlike Conduct<td>
<font size="2">03:26<td>
<font size="2">01:26<tr>
<td nowrap>
<font size="2">Bench<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Too Many Men<td>
<font size="2">01:46<td>
<font size="2">
00:00</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Marthaler, C<td>
<font size="2">
<center>
1<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">01:19<td>
<font size="2">16:19<tr>
<td nowrap>
<font size="2">Underwood, J<td>
<font size="2">
<center>
2<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">12:32<td>
<font size="2">10:32<tr>
<td nowrap>
<font size="2">Marthaler, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:39<td>
<font size="2">
09:39</table>
<tr>
<td colspan="2">
<center>
<font size="+1"><b>Goalies<tr>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Baird, T<td>
<font size="2">20<td>
<font size="2">1<tr>
<td>
<font size="2"><b>Open Net<td>
<td>
<font size="2">
0</table>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Hirdler, T<td>
<font size="2">42<td>
<font size="2">
奇怪的是,所有浏览器似乎都很好。 PHPTidy设法很好地理解了这一切,但是这些表是如此深入且几乎随机地嵌套,使用DOM XPath很难遍历它。
是否有人对其他方法有任何建议?
POST-MORTEM :经过太多的比利时小麦啤酒并弄脏了我的代码真的很好我通过strip_tags()删除所有标签,除了table,tr和td,然后运行它通过libtidy。它现在格式很漂亮,很容易遍历。看起来它只需要一点点按摩然后再发送到解析器。
答案 0 :(得分:3)
您可以使用一些技巧来清理表格等高度可预测的结构。在运行HTML整理之前,您可以使用正则表达式或其他内容来搜索<tr>
和<td>
,然后是另一个<tr>
或<td>
,并插入相应的紧接着它。在<td>
内部容纳表格有一些额外的技巧,但没有什么是不可能处理的。首先找到最里面的结构并从那里向外移动。
真正的难题是未闭合的<div>
和<p>
之类的东西,它们可能更难以与相应的(或缺少的)闭合器匹配。
答案 1 :(得分:2)
如果您对Python等其他语言持开放态度,Beautiful Soup非常适合重建编写糟糕的HTML。我刚刚尝试通过以下代码片段运行您的HTML,现在它非常易读。
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
html = "long string of html"
soup = BeautifulSoup(html)
print soup.prettify()
答案 2 :(得分:2)
如果您正在寻找数据,我只需删除所有html 并将其作为逐行原始输入处理。您可以使用strip_tags功能。
$clean = strip_tags($input);
// example: <p>Test paragraph.</p> <a href="#fragment">Other text</a>
// returns: Test paragraph. Other text
答案 3 :(得分:0)
使用正则表达式而不是将其解析为XML,可能会更好地抓取所需的结果。
答案 4 :(得分:0)
我在Python的lxml库中使用了xpath来解析IMDB Top 250页面。 View the source让自己看看它有多糟糕。
以下代码解析已保存的IMDB前250页(top250.html
)并将提取的信息存储在sqlite数据库中(top250.db
)
import sqlite3
from lxml import html
tree = html.parse('top250.html')
class TopMovie(object):
base_xpath = "/html/body/div/div[2]/layer/div[3]/table/tr/td[3]/div/table/tr/td/table/tr[%d]"
def __init__(self, num):
self.rank = num
self.xpath = self.base_xpath % (self.rank + 1)
def rating(self):
return tree.xpath(self.xpath + '/td[2]/font')[0].text
def link(self):
return tree.xpath(self.xpath + '/td[3]/font/a')[0].values()[0]
def title(self):
return tree.xpath(self.xpath + '/td[3]/font')[0].text_content()
def votes(self):
return tree.xpath(self.xpath + '/td[4]/font')[0].text
def main():
conn = sqlite3.connect('top250.db')
conn.execute("""DROP TABLE IF EXISTS movies""")
conn.execute("""
CREATE TABLE movies (
id INTEGER PRIMARY KEY,
title TEXT,
link TEXT,
rating TEXT,
votes INTEGER
)""")
for n in xrange(1, 251):
m = TopMovie(n)
query = r'INSERT INTO movies VALUES (%d, "%s", "%s", "%s", "%s")' \
% (n, m.title(), m.link(), m.rating(), m.votes().replace(',', ''))
conn.execute(query)
conn.commit()
conn.close()
if __name__ == "__main__":
main()