网站抓取-该网站显示的内容与我的抓取工具不同

时间:2020-01-13 17:46:51

标签: python web-scraping beautifulsoup

我为我的大学开发了一个项目,该项目从团队中获取数据并进行一些统计操作和其他工作。我从中获得数据的网站是这样的:http://www.acb.com/club/estadisticas/id/13

我想获取不同季节的数据,但是当我运行代码时,获得的内容与网站不同,例如,2014年的统计数据:

import requests
from bs4 import BeautifulSoup

def scrap_web(page):
    pageTree = requests.get(page)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    TeamPage = pageSoup.find('div',{"class":'estadisticas_plantilla'}).find('tr',{"class": 'totales'})
    ValuesList = TeamPage.text.split('\n')[2:-1]

    arr = list(ValuesList)
    return arr


urltest = "http://www.acb.com/club/estadisticas/id/13/temporada_id/2014"

print(scrap_web(urltest))

我收到的数据来自本赛季,而不是2014赛季。问题可能是内容是通过javascript注入页面的?

2 个答案:

答案 0 :(得分:1)

这与您的代码有些不同,但是它可以使您足够接近所需的内容,并且可以从那里获取它:

url = "http://www.acb.com/club/estadisticas/id/13/temporada_id/2014"
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

resp = requests.get(url)

soup = bs(resp.text,'lxml')
table = soup.find_all('table')[0]
lower = table.select_one('tr.cabecera_general').findNextSibling()
table_rows = table.find_all('tr')
columns = []
rows = []

for c in lower.find_all('th'):
    columns.append(c.text)
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    rows.append(row)

games = pd.DataFrame(rows,columns=columns)
games

答案 1 :(得分:1)

多年来,分配给PHPSESSID的属性没有变化,即使在浏览器中也返回错误的数据。在实际工作多年之前获得http://www.acb.com/club/estadisticas/id/13

经过一些尝试:

import random

user_agents = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.6.01001)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.7.01001)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.5.01003)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",
    "Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
    "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0",
    "Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)",
    "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.01",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0.1) Gecko/20100101 Firefox/5.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
    "Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.0) Opera 7.02 Bork-edition [en]",
]
cookies = {
    'acepta_uso_cookies': '1',
}
headers = {
    'acepta_uso_cookies': "1",
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'DNT': '1',
    'User-Agent': random.sample(user_agents, 1)[0],
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}

with requests.Session() as s:
    s.headers = headers
    s.get('http://www.acb.com/club/estadisticas/id/13', cookies=cookies)

    for year in range(2019, 2012, -1):
        response = s.get(f'http://www.acb.com/club/estadisticas/id/13/temporada_id/{year}', cookies=cookies)
        s.headers['Referer'] = f'http://www.acb.com/club/estadisticas/id/13/temporada_id/{year + 1}'
        pageSoup = BeautifulSoup(response.content, 'html.parser')
        TeamPage = pageSoup.find('div', {"class": 'estadisticas_plantilla'}).find('tr', {"class": 'totales'})
        ValuesList = TeamPage.text.split('\n')[2:-1]
        print(year, list(ValuesList))

2019 ['Totales','18','\ xa0','\ xa0','85,6','9,8','26,3','37,1%', '21,0','37,9','55,4%','14,2','17,6','80,7%','25,6','11,6', '37,2','16,3','6,8','12,0','1,9','2,4','3,2','17,7','20 ,4', '\ xa0','99,3']
2018 ['Totales,'40','\ xa0','\ xa0','81,1', '10,2','27,0','37.8%','19,0','33,6','56.5%','12,5','15,5', '80,7%','23,4','10,2','33,6','17,6','5,4','12,0','1,6',' 2,3', '3,2','18,0','19,4','\ xa0','92,1']
2017 ['Totales','37', '\ xa0','\ xa0','82,1','10,8','26,0','41,4%','17,5','33,5', '52,1%','14,9','18,9','78.8%','24,2','9,5','33,7','16,7', '6,1','11,5','1,6','2,3','2,3','18,4','20,8','\ xa0', '93,5']
2016 ['Totales','43','\ xa0','\ xa0','81,8','8,7', '23,3','37,3%','20,1','36,7','54,7%','15,5','20,0','77,7%' , '24,4','9,7','34,1','17,6','7,0','12,7','1,7','2,2','1 ,8', '19,0','21,9','\ xa0','94,4']
2015 ['Totales','40','\ xa0', '\ xa0','82,8','9,0','23,2','38,9%','20,3','37,1','54,7%', '15,1','18,8','80,4%','24,5','9,6','34,0','17,5','7,2',' 11,9', '2,5','2,8','1,9','20,1','22,1','\ xa0','96,6']
2014 ['Totales','41','\ xa0','\ xa0','83,0','9,2','23,9','38,4%', '20,1','36,7','54,8%','15,3','19,3','79,1%','22,0','9,9', '31,9','16,7','7,8','12,7','2,2','1,8','1,5','23,3','22 ,4', '\ xa0','90,7']
2013 ['Totales,'42','\ xa0','\ xa0','84,0', '9,5','24,3','38,9%','20,0','38,0','52,6%','15,6','18,9', '82,8%','23,9','9,3','33,3','17,8','9,0','11,6','2,3',' 2,3', '1,9','20,7','21,3','\ xa0','96,9']