Question

我正在尝试解析一个看起来像这样的表：

<table>
    <tr> <th> header1 </th> <th> header2 </th> </tr>
    <th> missing1 </th> <th> missing2 </th>
    <tr> <td> data1 </td> <td> data2 </td> </tr>
</table>

我特别需要访问其中包含“missing”的行。有没有办法访问该行？这个表在浏览器中渲染得很好，所以我希望BeautifulSoup找到它，但是b.findAll('tr')错过了它。

编辑：一个特定的，更复杂的例子：http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag=特别是带有“线转换”的表格，其中包含几列

具体问题的例子：

import requests
from bs4 import BeautifulSoup
r = BeautifulSoup(requests.get('http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag=').content)
table = r.select('table:nth-of-type(5) tr')

table缺少此行（包含在源代码中）：r.select('table tr')[19]

Answer 1

这取决于解析器如何处理它。 HTML被破坏了，虽然HTML解析器无论如何都会尽力表示数据，但是他们如何这样做并不是由任何标准定义的。

BeautifulSoup可以使用different parsers;默认情况下，使用内置的Python标准库解析器。如果您安装lxml，则会使用它的解析器。您还可以使用html5lib外部模块：

>>> from bs4 import BeautifulSoup
>>> broken = '''\
... <table>
...     <tr> <th> header1 </th> <th> header2 </th> </tr>
...     <th> missing1 </th> <th> missing2 </th>
...     <tr> <td> data1 </td> <td> data2 </td> </tr>
... </table>
... '''
>>> BeautifulSoup(broken, 'html.parser').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]
>>> BeautifulSoup(broken, 'lxml').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]
>>> BeautifulSoup(broken, 'html5lib').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr><th> missing1 </th> <th> missing2 </th>
    </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]

如您所见，html5lib解析器在树中包含了missing文本的行：

>>> BeautifulSoup(broken, 'html5lib').select('table tr:nth-of-type(2)')
[<tr><th> missing1 </th> <th> missing2 </th>
    </tr>]

如果您需要按标题查找特定表格，可能先搜索标题，然后导航到父表格：

import requests
from bs4 import BeautifulSoup

url = 'http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html5lib')

table = soup.find(text='Fitted Parameters for Observed Molecular Transitions').find_parent('table')
for row in table.find_all('tr'):
    print row

连续丢失的表：我可以解析它吗？

1 个答案: