从篮球中抓取数据时参考如何将某些表格注释掉?

时间:2018-04-11 03:41:35

标签: python web-scraping beautifulsoup

我正试图通过使用BeautifulSoup的玩家来获取篮球参考的所有数据。我们以迈克尔乔丹为例:https://www.basketball-reference.com/players/j/jordami01.html。问题是,当我抓住html页面并通过html解析时,我只能抓取一个数据表而其他人似乎被注释掉了。我是python的新手,并希望有人可以告诉我为什么html似乎有某些数据表作为评论。有人可以帮我解决一下吗?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pd

MJ_url = 'https://www.basketball-reference.com/players/j/jordami01.html'

uClient = uReq(MJ_url)

MJ_html = uClient.read()

uClient.close()

MJ_soup = soup(MJ_html, "html.parser")

MJ_containers = MJ_soup.findAll("table",{"class":"row_summable sortable 
stats_table"})

1 个答案:

答案 0 :(得分:1)

试试这个。评论中的所有数据现在都已经过去了:

import requests
from bs4 import BeautifulSoup, Comment

res = requests.get("https://www.basketball-reference.com/players/j/jordami01.html",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, 'lxml')
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
    data = BeautifulSoup(comment,"lxml")
    for items in data.select("table.row_summable tr"):
        tds = [item.get_text(strip=True) for item in items.select("th,td")]
        print(tds)