Python Scrape urllib2 HTTP错误

时间:2016-03-29 21:04:07

标签: python urllib2 scrape scraper http-error

我正在尝试抓一个网站但我的代码只有在我打开网站然后刷新它才有效。我尝试过多种方法并继续遇到以下两个错误: 第一个:ValueError:“HTTPError:HTTP错误416:请求的范围不满足”

urlslist = open("list_urls.txt").read()
urlslist = urlslist.split("\n")
for urlslist in urlslist:

htmltext = urllib2.urlopen("www..."+ urlslist)
data = json.load(htmltext)

我也尝试过使用一些标题等但是得到错误'ValueError:No JSON object not be decoding':

req = urllib2.Request('https://www....)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')

htmltext = urllib2.urlopen(req)
data = json.load(htmltext)

我很难过,有什么帮助吗?

1 个答案:

答案 0 :(得分:-1)

When you request a url, you need to include the "http(s)://" part as well. Assuming that the text file you have just contains the "name.com" part of the url (e.g. instead of https://www.google.com, your text file has google.com), this is the code you need:

htmltext = urllib2.urlopen("https://www." + urlslist)

If the url is the stubhub.com (as you mentioned in your comment) one, you don't need the "s." It would be this instead:

htmltext = urllib2.urlopen("http://www." + urlslist)

The json error may simply be due to the fact that there are no json files to load. You'll need to take a look at the developer panel and make sure that json format files are being brought in.