对于我想要获取的链接,我收到重复的链接,我不知道为什么。此外,我试图获取所有链接,如我从所有页面获取的链接。但我不知道如何编写代码来点击下一页。有人可以帮我理解我会怎么做?
import requests
from bs4 import BeautifulSoup
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
#all_teams = []
for team_links in soup.find_all('a', href=True):
if team_links['href'] == '' or team_links['href'].startswith('/counterstrike/teams'):
print (team_links.get('href').replace('/counterstrike/teams', url))
答案 0 :(得分:3)
团队链接位于 h3 标记内的 anchor 标记内, div 内部带有详细信息类:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
base = "http://www.gosugamers.net"
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
print ( urljoin(base, team_links["href"]))
这给了你:
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
您正在解析页面上的所有链接,这就是您看到欺骗的原因。
要获得所有团队,我们可以解析下一页链接,直到带有"Next"
文本的范围不再存在,只会出现在最后一页:
def get_all(url, base):
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
while nxt:
r = requests.get(urljoin(base, nxt.find_previous("a")["href"]))
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
如果我们运行它几秒钟,您可以看到我们接下来的页面:
In [26]: for link in (get_all(url, base)):
....: print(link)
....:
http://www.gosugamers.net/counterstrike/teams/16386-cantonese-cs
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
http://www.gosugamers.net/counterstrike/teams/15810-ex-deathtrap
http://www.gosugamers.net/counterstrike/teams/15808-mix123
http://www.gosugamers.net/counterstrike/teams/15651-undertake-esports
http://www.gosugamers.net/counterstrike/teams/15644-five
http://www.gosugamers.net/counterstrike/teams/15630-five
http://www.gosugamers.net/counterstrike/teams/15627-inetkoxtv
http://www.gosugamers.net/counterstrike/teams/15626-tetr-s
http://www.gosugamers.net/counterstrike/teams/15625-rozenoir-esports-white
http://www.gosugamers.net/counterstrike/teams/15619-fragment-gg
http://www.gosugamers.net/counterstrike/teams/15615-monarchs-gg
http://www.gosugamers.net/counterstrike/teams/15602-ottoman-fire
http://www.gosugamers.net/counterstrike/teams/15591-respect
http://www.gosugamers.net/counterstrike/teams/15569-moonbeam-gaming
http://www.gosugamers.net/counterstrike/teams/15563-team-tilt
http://www.gosugamers.net/counterstrike/teams/15534-dynasty-uk
http://www.gosugamers.net/counterstrike/teams/15507-urbantech
http://www.gosugamers.net/counterstrike/teams/15374-innova
http://www.gosugamers.net/counterstrike/teams/15373-g3x
http://www.gosugamers.net/counterstrike/teams/15372-cnb
http://www.gosugamers.net/counterstrike/teams/15370-intz
http://www.gosugamers.net/counterstrike/teams/15369-2kill
http://www.gosugamers.net/counterstrike/teams/15368-supernova
http://www.gosugamers.net/counterstrike/teams/15367-biggods
http://www.gosugamers.net/counterstrike/teams/15366-playzone
http://www.gosugamers.net/counterstrike/teams/15365-pride
http://www.gosugamers.net/counterstrike/teams/15359-rising-orkam
http://www.gosugamers.net/counterstrike/teams/15342-team-foxez
http://www.gosugamers.net/counterstrike/teams/15336-angels
http://www.gosugamers.net/counterstrike/teams/15331-atlando-esports
http://www.gosugamers.net/counterstrike/teams/15329-xfinity-esports
http://www.gosugamers.net/counterstrike/teams/15326-nano-reapers
http://www.gosugamers.net/counterstrike/teams/15322-erase-team
http://www.gosugamers.net/counterstrike/teams/15318-heyguys
http://www.gosugamers.net/counterstrike/teams/15317-illusory
http://www.gosugamers.net/counterstrike/teams/15285-dismay
http://www.gosugamers.net/counterstrike/teams/15284-kingdom-esports
http://www.gosugamers.net/counterstrike/teams/15283-team-rival
http://www.gosugamers.net/counterstrike/teams/15282-ze-pug-godz
http://www.gosugamers.net/counterstrike/teams/15281-unlimited-potential1
你可以在源代码中看到第一个和任何一个栏,最后一页是Next
的范围:
当我们到达最后一个时,只有上一个和第一个的跨度: