使用BeautifulSoup进行网络抓取:阅读表格

时间:2019-09-26 18:49:00

标签: web-scraping beautifulsoup

我正在尝试从transfermarkt.com上的表中获取数据。我能够使用以下代码获得前25个条目。但是,我需要获取以下页面中的其余条目。当我单击第二页时,URL不会更改。

我试图增加for循环的范围,但它给出了一个错误。任何建议将不胜感激。

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop'
heads = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/70.0.3538.110 Safari/537.36'}

r = requests.get(url, headers = heads)
source = r.text
soup = BeautifulSoup(source, "html.parser")
players = soup.find_all("a",{"class":"spielprofil_tooltip"})
values = soup.find_all("td",{"class":"rechts hauptlink"})

playerslist = []
valueslist = []

for i in range(0,25):
    playerslist.append(players[i].text)
    valueslist.append(values[i].text)
df = pd.DataFrame({"Players":playerslist, "Values":valueslist})

1 个答案:

答案 0 :(得分:0)

更改循环中的网址,并更改选择器

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

players = []
values = []
headers = {'User-Agent':'Mozilla/5.0'}

with requests.Session() as s:
    for page in range(1,21):
        r = s.get(f'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={page}', headers=headers)
        soup = bs(r.content,'lxml')
        players += [i.text for i in soup.select('.items .spielprofil_tooltip')]
        values += [i.text for i in soup.select('.items .rechts.hauptlink')]
df = pd.DataFrame({"Players":players, "Values":values})