所以我想从transfermarkt获得英超所有足球俱乐部的每个球员的名字。 作为测试,我要尝试的页面是:https://www.transfermarkt.co.uk/ederson/profil/spieler/238223
我发现Xpath是:
//*[@id="main"]/div[10]/div[1]/div[2]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td
请记住,由于Html代码的结构,我必须使用Xpath,并且必须为俱乐部中的所有球员,英超联赛中的所有俱乐部做一个For循环。我已经通过此代码获得了链接:
# Create empty list for player link
playerLink1 = []
playerLink2 = []
playerLink3 = []
#For each team link page...
for i in range(len(Full_Links)):
#...Download the team page and process the html code...
squadPage = requests.get(Full_Links[i], headers=headers)
squadTree = squadPage.text
SquadSoup = BeautifulSoup(squadTree,'html.parser')
#...Extract the player links...
playerLocation = SquadSoup.find("div", {"class":"responsive-table"}).find_all("a",{"class":"spielprofil_tooltip"})
for a in playerLocation:
playerLink1.append(a['href'])
[playerLink2.append(x) for x in playerLink1 if x not in playerLink2]
#...For each player link within the team page...
for j in range(len(playerLink2)):
#...Save the link, complete with domain...
temp2 = "https://www.transfermarkt.co.uk" + playerLink2[j]
#...Add the finished link to our teamLinks list...
playerLink3.append(temp2)
链接位于名为“ playerLink3_u”的列表变量中
我该怎么做?
答案 0 :(得分:2)
我不确定如何使用XPath命名名称。您已经导入了BS4,所以我写了一些代码从您发布的URL中获取播放器名称。
import requests
from bs4 import BeautifulSoup
request_page = requests.get("http://www.transfermarkt.co.uk/ederson/profil/spieler/238223", headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"})
page_soup = BeautifulSoup(request_page.text, 'html.parser')
player_table = page_soup.find('table', {'class': 'auflistung'})
table_data = player_table.findAll('td')
print('Name: ', table_data[0].text)
print('Date Of Birth: ', table_data[1].text)
print('Place Of Birth: ', table_data[2].text)
这将返回名称date_of_birth和place_of_birth。