我正在尝试从链接列表(所有链接到同一网站上的不同页面)中抓取链接,但是我一直运行403错误。这是我要抓取的链接的示例
https://www.spectatornews.com/page/6/?s=band
https://www.spectatornews.com/page/7/?s=band
等
这是我的代码:
getarticles = []
from bs4 import BeautifulSoup
import urllib.request
for i in listoflinks:
resp = urllib.request.urlopen(i)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
我一直在尝试使用HTTP error 403 in Python 3 Web Scraping中的一些答案,但是并没有取得太大的成功。我不确定是否将它们正确地应用到我的整个链接列表中。我试图通过使用标头使用以下解决方案之一,但返回HTTP 406错误:不可接受
这是我尝试修复的代码:
getarticles = []
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import urllib.request
for i in listoflinks:
req=urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
任何帮助将不胜感激。我对此很陌生,因此您可以解释并提供很多帮助。我只想从我的网站列表中收集链接!
谢谢
答案 0 :(得分:0)
服务器可以理解该请求,但拒绝对其进行授权。
目标资源没有当前的表示形式 根据主动,用户代理可以接受 在请求中收到的协商标头字段,并且服务器是 不愿意提供默认表示。
您的用户代理可能是问题。我可以通过更改获得输出
from bs4 import BeautifulSoup
import urllib.request
listoflinks=['https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/6/?s=band']
getarticles = []
for i in listoflinks:
req = urllib.request.Request(
i,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
resp= urllib.request.urlopen(req)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features="lxml")
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
print(getarticles)
输出
['https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/about/', 'https://www.spectatornews.com/about/editorial-policy/', 'https://www.spectatornews.com/about/correction-policy/', 'https://www.spectatornews.com/about/bylaws/', 'https://www.spectatornews.com/advertise/', 'https://www.spectatornews.com/contact/', 'https://www.spectatornews.com/staff/', 'https://www.spectatornews.com/submit-a-letter/', 'https://www.spectatornews.com/submit-a-news-tip/', '/', 'https://www.spectatornews.com', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', 'https://www.spectatornews.com/campus-news/2004/05/06/english-fest-draws-speakers-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/campus-news/2004/05/03/burgers-on-the-grill-bands-on-the-scene/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2004/04/29/hempfest-celebrates-its-10th-year-with-11-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/29/pat-mcgee-band-rocks-mad-town/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/22/leinenkugels-battle-of-the-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/08/on-the-music-scene-band-makes-mondays-better/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/03/18/on-the-music-scene-band-carries-on-duluozs-work/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/10/09/jamband-grooving-to-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/05/01/joepalooza-set-with-5-bands-one-drummer/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/campus-news/2003/05/01/hempfest-features-nine-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2003/02/17/houston-based-band-reaching-out-to-college-students-on-tour/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/?s=band', 'https://www.spectatornews.com/page/2/?s=band', 'https://www.spectatornews.com/page/3/?s=band', 'https://www.spectatornews.com/page/4/?s=band', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com/page/8/?s=band', 'https://www.spectatornews.com/page/9/?s=band', 'https://www.spectatornews.com/page/127/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', '/', 'https://snosites.com/why-sno/', 'http://snosites.com', 'https://www.spectatornews.com/wp-login.php', '#top', '/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', 'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/about/', 'https://www.spectatornews.com/about/editorial-policy/', 'https://www.spectatornews.com/about/correction-policy/', 'https://www.spectatornews.com/about/bylaws/', 'https://www.spectatornews.com/advertise/', 'https://www.spectatornews.com/contact/', 'https://www.spectatornews.com/staff/', 'https://www.spectatornews.com/submit-a-letter/', 'https://www.spectatornews.com/submit-a-news-tip/', '/', 'https://www.spectatornews.com', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/', '/', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', 'https://www.spectatornews.com/campus-news/2004/05/06/english-fest-draws-speakers-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/campus-news/2004/05/03/burgers-on-the-grill-bands-on-the-scene/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2004/04/29/hempfest-celebrates-its-10th-year-with-11-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/29/pat-mcgee-band-rocks-mad-town/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/22/leinenkugels-battle-of-the-bands/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/04/08/on-the-music-scene-band-makes-mondays-better/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2004/03/18/on-the-music-scene-band-carries-on-duluozs-work/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/10/09/jamband-grooving-to-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/05/01/joepalooza-set-with-5-bands-one-drummer/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/campus-news/2003/05/01/hempfest-features-nine-bands/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/showcase/2003/02/17/houston-based-band-reaching-out-to-college-students-on-tour/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/', 'https://www.spectatornews.com/category/showcase/', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/?s=band', 'https://www.spectatornews.com/page/2/?s=band', 'https://www.spectatornews.com/page/3/?s=band', 'https://www.spectatornews.com/page/4/?s=band', 'https://www.spectatornews.com/page/5/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com/page/8/?s=band', 'https://www.spectatornews.com/page/9/?s=band', 'https://www.spectatornews.com/page/127/?s=band', 'https://www.spectatornews.com/page/7/?s=band', 'https://www.spectatornews.com', 'https://www.spectatornews.com/feed/rss/', '#', 'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ', 'https://www.snapchat.com/add/spectator news', 'https://www.instagram.com/spectatornews/', 'http://twitter.com/spectatornews', 'http://facebook.com/spectatornews', '/', 'https://snosites.com/why-sno/', 'http://snosites.com', 'https://www.spectatornews.com/wp-login.php', '#top', '/', 'https://www.spectatornews.com/category/campus-news/', 'https://www.spectatornews.com/category/currents/', 'https://www.spectatornews.com/category/sports/', 'https://www.spectatornews.com/category/opinion/', 'https://www.spectatornews.com/category/multimedia-2/']
编辑以处理404错误:
列表中的某些链接可能不可用。一种选择是使用try-except块来处理这些问题并处理其余链接
所以最终的代码应该是
from bs4 import BeautifulSoup
import urllib.request
listoflinks=['https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/6/?s=band','https://www.spectatornews.com/page/100099?s=band','http://sdfgsdjhgfjsgdhfgsj.com']
getarticles = []
for i in listoflinks:
req = urllib.request.Request(
i,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
try:
resp= urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
if e.code == 404:
print("Unavailable link",i," skipping---")
else:
raise e
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features="lxml")
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
print(getarticles)
答案 1 :(得分:0)
我要预先说,我很少使用urllib / 3库。但是,我确实尝试了使用scrapy的shell终端命令以及没有用户代理的请求库,并获得了200条响应。
我确实注意到您在声明“汤”时没有声明解析器的类型。
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
尽管我更喜欢使用scrapy的解析器,尽管它比较重,但是如果没有记错的话,您必须声明一个解析器类型,例如
soup = BeautifulSoup(resp, "lxml")
Bitto Benni-chan说他设法用200 urllib.request进行响应,因此请尝试进行更改。只是输入了完整的用户代理名称。
我的建议是使用请求库。我认为这将是一个足够简单的更改。
from bs4 import BeautifulSoup
import requests
listoflinks = ['https://www.spectatornews.com/page/6/?s=band', 'https://www.spectatornews.com/page/7/?s=band']
getarticles = []
for i in listoflinks:
resp = requests.get(i)
soup = BeautifulSoup(resp.content, "lxml")
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
getarticles列表输出如下:
'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/',
'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/about/',
'https://www.spectatornews.com/about/editorial-policy/',
'https://www.spectatornews.com/about/correction-policy/',
'https://www.spectatornews.com/about/bylaws/',
'https://www.spectatornews.com/advertise/',
'https://www.spectatornews.com/contact/',
'https://www.spectatornews.com/staff/',
'https://www.spectatornews.com/submit-a-letter/',
'https://www.spectatornews.com/submit-a-news-tip/',
'/',
'https://www.spectatornews.com',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'https://www.spectatornews.com/campus-news/2002/05/09/late-night-bus-service-idea-abandoned-due-to-expense/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/opinion/2002/03/21/yates-deserved-what-she-got-husband-also-to-blame/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/opinion/2001/11/29/air-force-concert-band-inspires-zorn-arena-audience/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/campus-news/2001/10/25/goth-style-bands-will-entertain-at-halloween-costume-concert/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/campus-news/2001/04/19/campus-group-will-host-hemp-event-with-bands-information/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/staff/?writer=Alanna%20Huggett',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/geekcon/',
'https://www.spectatornews.com/tag/tv10/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/staff/?writer=Kar%20Wei%20Cheng',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/accessories/',
'https://www.spectatornews.com/tag/fashion/',
'https://www.spectatornews.com/tag/multimedia/',
'https://www.spectatornews.com/tag/winter/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/staff/?writer=Julia%20Van%20Allen',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/tag/dancing/',
'https://www.spectatornews.com/tag/harry-potter/',
'https://www.spectatornews.com/tag/smom/',
'https://www.spectatornews.com/tag/student-ministry-of-magic/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/tag/yule/',
'https://www.spectatornews.com/tag/yule-ball/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/staff/?writer=Madeline%20Fuerstenberg',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/tag/1950/',
'https://www.spectatornews.com/tag/1975/',
'https://www.spectatornews.com/tag/2000/',
'https://www.spectatornews.com/tag/articles/',
'https://www.spectatornews.com/tag/spectator/',
'https://www.spectatornews.com/tag/throwback/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/staff/?writer=Taylor%20Reisdorf',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/altoona/',
'https://www.spectatornews.com/tag/boss-women/',
'https://www.spectatornews.com/tag/business-women/',
'https://www.spectatornews.com/tag/cherish-woodford/',
'https://www.spectatornews.com/tag/crossfit/',
'https://www.spectatornews.com/tag/crossfit-river-prairie/',
'https://www.spectatornews.com/tag/eau-claire/',
'https://www.spectatornews.com/tag/fitness/',
'https://www.spectatornews.com/tag/gym/',
'https://www.spectatornews.com/tag/local/',
'https://www.spectatornews.com/tag/nicole-randall/',
'https://www.spectatornews.com/tag/river-prairie/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/staff/?writer=Lea%20Kopke',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/bad-art/',
'https://www.spectatornews.com/tag/fmdown/',
'https://www.spectatornews.com/tag/ghosts-of-the-sun/',
'https://www.spectatornews.com/tag/music/',
'https://www.spectatornews.com/tag/pablo-center/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/staff/?writer=Stephanie%20Janssen',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/satire/',
'https://www.spectatornews.com/tag/sleepy/',
'https://www.spectatornews.com/tag/tator/',
'https://www.spectatornews.com/tag/uw-eau-claire/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/10/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/']