BeautifulSoup超时实例化?

时间:2016-06-30 18:18:32

标签: python web-scraping beautifulsoup

我正在使用BeautifulSoup进行网页抓取,我遇到了一个奇怪的错误。代码:

print "Running urllib2"
g = urllib2.urlopen(link + "about", timeout=5)
print "Finished urllib2"
about_soup = BeautifulSoup(g, 'lxml')

这是输出:

Running urllib2
Finished urllib2

Error
    Traceback (most recent call last):
      File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars
        for row in self.instance.get_entries():
      File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries
        about_soup = BeautifulSoup(g, 'lxml')
      File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
        markup = markup.read()
      File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read
        data = self._sock.recv(rbufsize)
      File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read
        return self._read_chunked(amt)
      File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked
        value.append(self._safe_read(amt))
      File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read
        chunk = self.fp.read(min(amt, MAXAMOUNT))
      File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read
        data = self._sock.recv(left)
    timeout: timed out

我理解urllib2.urlopen可能导致问题,但异常发生在实例化BeautifulSoup的行中。我做了一些谷歌搜索,但无法找到有关BeautfiulSoup超时问题的任何内容。

有关正在发生的事情的任何想法?

1 个答案:

答案 0 :(得分:2)

这是导致超时的urllib2部分。

您认为BeautifulSoup实例化行失败的原因是内部g正在读取BeautifulSoupthe file-like object 。这是stacktrace的一部分证明:

File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
    markup = markup.read()