Question

我正在尝试抓取此网站：http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw并将表格放在底部。当我试图刮掉它时，我得到了第一行的一些元素，但没有从表的其余部分获得。这是我的代码

urlText = "http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw"
url = urlopen(urlText)
soup = bs.BeautifulSoup(url,"html.parser")
certificates = soup.find('table',class_='table table-bordered')
for row in certificates.find_all('tr'):
    for td in row.find_all('td'):
        print td.text

我得到的输出是：

22-20353

                                SHIP SECURITY OFFICER

而不是整张桌子。我错过了什么？

Answer 1

这是underlying parser makes a difference时的又一个案例。切换到lxml或html5lib以查看已解析的完整表：

soup = bs.BeautifulSoup(url, "lxml")
soup = bs.BeautifulSoup(url, "html5lib")

python bs4 scrape表得到错误的结果

1 个答案: