Question

我正试图从这个页面中删除： http://www.scoresway.com/?sport=basketball&page=match&id=45926

但无法获取部分数据。

页面上的第二个表包含主队的比分。分数分数介于＆＃39; basic＆＃39;之间。并且＆＃39;高级＆＃39;统计。此代码打印基本＆＃39;主队的统计数据。

from BeautifulSoup import BeautifulSoup
import requests

gameId = 45926
url = 'http://www.scoresway.com/?sport=basketball&page=match&id=' + str(gameId)
r = requests.get(url)
soup = BeautifulSoup(r.content)

for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
    print ''.join(x.findAll(text=True))

如果你想看到'高级＆＃39;您点击高级＆＃39;链接的统计信息＆＃39;它会让你在同一页面上显示它。我也想抓取这些信息，但不知道该怎么做。

Answer 1

advanced标签有单独的请求。模拟它并使用BeautifulSoup解析。

例如，这里是打印表格中所有玩家的代码：

import requests
from bs4 import BeautifulSoup


ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id=45926&sport=basketball&localization_id=www"

response = requests.get(ADVANCED_URL)
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]

打印：

[u'T. Chandler  *', 
 u'K. Durant  *', 
 u'L. James  *',
 u'R. Westbrook',
 ...
 u'C. Anthony']

如果你看一下ADVANCED_URL，你就会看到唯一的动态＆＃34; url GET参数的一部分是match_id和sport参数。如果您需要在网站上使代码可重复使用并适用于此类其他网页，则需要动态填写match_id和sport。示例实现：

from bs4 import BeautifulSoup
import requests

BASE_URL = 'http://www.scoresway.com/?sport={sport}&page=match&id={match_id}'
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id={match_id}&sport={sport}&localization_id=www"


def get_match(sport, match_id):
    # basic
    r = requests.get(BASE_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(r.content)

    for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
        print ''.join(x.findAll(text=True))

    # advanced
    response = requests.get(ADVANCED_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(response.text)
    print [td.text.strip() for td in soup('td', class_='name')]


get_match('basketball', 45926)

Python Beautifulsoup抓包含Javascript的页面

1 个答案: