从没有唯一类的页面上抓取某些内容

时间:2020-03-03 16:52:37

标签: python web-scraping

因此,我正在尝试从transfermarkt页面为英超联赛中的所有球员收集不同类型的信息。

相关代码为:

# Create empty list for player link
playerLink1 = []
playerLink2 = []
playerLink3 = []
#For each team link page...
for i in range(len(Full_Links)):
    #...Download the team page and process the html code...
    squadPage = requests.get(Full_Links[i], headers=headers)
    squadTree = squadPage.text
    SquadSoup = BeautifulSoup(squadTree,'html.parser')

    #...Extract the player links...

    playerLocation = SquadSoup.find("div", {"class":"responsive-table"}).find_all("a",{"class":"spielprofil_tooltip"})

    for a in playerLocation:
        playerLink1.append(a['href'])
        [playerLink2.append(x) for x in playerLink1 if x not in playerLink2] 

    #...For each player link within the team page...
        for j in range(len(playerLink2)):

    #...Save the link, complete with domain...
            temp2 = "https://www.transfermarkt.co.uk" + playerLink2[j]
    #...Add the finished link to our teamLinks list...
            playerLink3.append(temp2)

#Populate lists with each player

#For each player...
for i in range(len(playerLink3_u)):
    #...download and process the two pages collected earlier...
    playerPage = requests.get(playerLink3_u[i], headers = headers)
    playerTree = playerPage.text
    PlayerSoup = BeautifulSoup(playerTree,'html.parser')

#...find the relevant datapoint for each player, starting with name...
    tempName = PlayerSoup.find("div", {"class":"spielerdaten "}).find_all("a",{"class":"spielprofil_tooltip"})

问题在于,在最后一行“ tempName”(这是错误的)中,我没有任何类来查找足球运动员的姓名。

这是玩家https://www.transfermarkt.co.uk/ederson/profil/spieler/238223

的链接

关于如何从此HTML代码提取数据的任何提示,因为除了名称之外,我还需要从同一位置获取更多数据?

2 个答案:

答案 0 :(得分:1)

页面是动态的,并在初始请求后呈现。您必须通过api(如果可用)访问数据,或使用浏览器模拟(如Selenium)打开页面,进行渲染,然后拉出html:

import pandas as pd
from selenium import webdriver

driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')

playerPage = driver.get('https://www.transfermarkt.co.uk/ederson/profil/spieler/238223')
df = pd.read_html(driver.page_source)[0]

输出:

print (df.to_string())
                                   0                          1
0                         Full name:  Ederson Santana de Moraes
1                     Date of birth:               Aug 17, 1993
2                    Place of birth:                Osasco (SP)
3                               Age:                         26
4                            Height:                     1,88 m
5                       Citizenship:            Brazil Portugal
6                          Position:                 Goalkeeper
7                              Foot:                       left
8                      Player agent:                  Gestifute
9                      Current club:            Manchester City
10                           Joined:                Jul 1, 2017
11                 Contract expires:                 30.06.2025
12  Date of last contract extension:               May 13, 2018
13                        Outfitter:                       Nike
14                     Social media:                        NaN

答案 1 :(得分:0)

我不知道这是否是针对您的情况的真正解决方案,但也许您可以使用元素的xpath而不是它的类。 Xpath是HTML代码到特定元素的路径。因此,如果播放器的名称在每个页面中都位于HTML脚本的相同位置,那么您可以每次都删除该元素

要在Firefox中查找xpath,必须在检查器模式下找到该元素,右键单击它->复制-> Xpath