当我想使用urllib2获取页面时,我没有得到整页。
这是python中的代码:
import urllib2
import urllib
import socket
from bs4 import BeautifulSoup
# define the frequency for http requests
socket.setdefaulttimeout(5)
# getting the page
def get_page(url):
""" loads a webpage into a string """
src = ''
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
src = response.read()
response.close()
except IOError:
print 'can\'t open',url
return src
return src
def write_to_file(soup):
''' i know that I should use try and catch'''
# writing to file, you can check if you got the full page
file = open('output','w')
file.write(str(soup))
file.close()
if __name__ == "__main__":
# this is the page that I'm trying to get
url = 'http://www.imdb.com/title/tt0118799/'
src = get_page(url)
soup = BeautifulSoup(src)
write_to_file(soup) # open the file and see what you get
print "end"
我整整一周都在努力寻找问题! 为什么我没有得到整页?
感谢您的帮助
答案 0 :(得分:2)
您可能需要多次调用read,只要它不返回指示EOF的空字符串:
def get_page(url):
""" loads a webpage into a string """
src = ''
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
chunk = True
while chunk:
chunk = response.read(1024)
src += chunk
response.close()
except IOError:
print 'can\'t open',url
return src
return src
答案 1 :(得分:2)
我有同样的问题,我虽然是urllib,但它是bs4。
而不是使用
BeautifulSoup(src)
或
soup = bs4.BeautifulSoup(html, 'html.parser')
尝试使用
soup = bs4.BeautifulSoup(html, 'html5lib')