Question

我有一些我用来刮网页的代码。该代码如下：

for pages in pagesToScrape:
     print('test')
     url = 'http://myurl.com' + str(pages)
     page = pandas.read_html(url, attrs={'class': 'tableToRead'}, header = 0)  # Scrape web page
     print('hi')  # This is never printed for some reason

如评论中所述，由于某种原因，pandas.read_html行下面的任何代码都从未执行过，但我也没有收到任何错误消息。这个代码在2个月前工作，所以我想知道是否有更改lxml，BeautifulSoup4或其中一个依赖项的内容，因为网页根本没有改变。我还验证了我使用的URL是有效的。对于其他测试，我也尝试过：

for pages in pagesToScrape:
         print('test')
         url = 'http://myurl.com' + str(pages)
         page = pandas.read_html(url, attrs={'class': 'tableToRead'}, header = 0)  # Scrape web page
         print(page)  # Doesn't print anything
         fasidfoaisdf()  # This non-existent function does not throw an error ever either...

有没有人有任何想法为什么会发生这种情况？我觉得至少我可以得到不存在的函数来抛出错误，但程序编译得很好，甚至每次运行for循环打印测试。

Python v3.5.3

BeautifulSoup4 v4.6.0

bs4 v0.0.1

lxml v3.7.3

编辑：我也尝试从read_html函数调用中删除'header = 0'并且没有任何改变。

Answer 1

如果有其他人遇到此问题，我将代码更改为：

for pages in pagesToScrape:
     print('test')
     url = 'http://myurl.com' + str(pages)
     try:     
         page = pandas.read_html(url, attrs={'class': 'tableToRead'}, header = 0)  # Scrape web page
     except Exception as e:
         print(e)
     print('hi')  # This is never printed for some reason

我收到了正确的错误消息，在我的情况下是没有安装html5lib。谢谢MaxU！

Bizzare Pandas.read_html错误

1 个答案: