我正在使用BeautifulSoup进行网页抓取,我遇到了一个奇怪的错误。代码:
print "Running urllib2"
g = urllib2.urlopen(link + "about", timeout=5)
print "Finished urllib2"
about_soup = BeautifulSoup(g, 'lxml')
这是输出:
Running urllib2
Finished urllib2
Error
Traceback (most recent call last):
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars
for row in self.instance.get_entries():
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries
about_soup = BeautifulSoup(g, 'lxml')
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read
return self._read_chunked(amt)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
timeout: timed out
我理解urllib2.urlopen
可能导致问题,但异常发生在实例化BeautifulSoup的行中。我做了一些谷歌搜索,但无法找到有关BeautfiulSoup
超时问题的任何内容。
有关正在发生的事情的任何想法?
答案 0 :(得分:2)
这是导致超时的urllib2
部分。
您认为BeautifulSoup
实例化行失败的原因是内部g
正在读取BeautifulSoup
,the file-like object, 。这是stacktrace的一部分证明:
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()