我正在尝试使用beautifulsoup从某个网站上搜集文章。我一直得到'HTTP Error 403:Forbidden'作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码:
url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03
timestamp = datetime.date.today()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Check if article is from Magharebia.com
# remaining issues: error 403: forbidden. Possible robots.txt?
# Can't scrape anything atm
if "magharebia.com" in url:
# Create a new file to write content to
#txt = open('%s.txt' % timestamp, "wb")
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Write the article title to the file
try:
title = soup.find("h2")
txt.write('\n' + "Title: " + str(title) + '\n' + '\n')
except:
print "Could not find the title!"
# Author/Location/Date
try:
artinfo = soup.find("h4").text
txt.write("Author/Location/Date: " + str(artinfo) + '\n' + '\n')
except:
print "Could not find the article info!"
# Retrieve all of the paragraphs
tags = soup.find("div", {'class': 'body en_GB'}).find_all('p')
for tag in tags:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
# Close txt file with new content added
txt.close()
Please enter a valid URL: http://magharebia.com/en_GB/articles/awi/features/2014/04 /14/feature-03
Traceback (most recent call last):
File "idle_test.py", line 18, in <module>
soup = BeautifulSoup(urllib2.urlopen(url).read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
答案 0 :(得分:3)
我能够使用urllib2重现403 Forbidden错误,我没有深入研究它,但以下内容对我有用:
import requests
from bs4 import BeautifulSoup
url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"
soup = BeautifulSoup(requests.get(url).text)
print soup # prints the HTML you are expecting