Question

我正在尝试从此网站https://www.basketball-reference.com/boxscores/201101100CHA.html上的Four Factors表中提取数据。我上桌遇到了麻烦。我尝试过

url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div',id='all_four_factors')

然后，当我尝试使用tr = div.find_all('tr')拉行时，我什么也没回来。

Answer 1

我看了一下您要抓取的HTML代码，问题是您要获取的标签都在注释部分中。 BeautifulSoup会将内部注释视为一堆文本，而不是实际的HTML代码。因此，您要做的就是获取注释的内容，然后将此字符串放回BeautifulSoup中：

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div', id='all_four_factors')

# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))

# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:

    # A perhaps crude but effective way of stopping at a comment
    # with HTML inside: see if the first character inside is '<'.
    if c.strip()[0] == '<':
        newsoup = BeautifulSoup(c.strip(), 'html.parser')
        tr = newsoup.find_all('tr')
        print(tr)

对此的一个警告是，BS将假定注释掉的代码是有效的，格式正确的HTML。不过，这对我有用，因此，如果页面保持相对不变，它将继续工作。

Answer 2

如果您查看list(div.children)[5]，这是唯一将tr作为子字符串的子级，您会意识到它是Comment对象，因此从技术上讲该tr节点下没有div元素。因此，div.find_all('tr')应该为空。

Answer 3

你为什么这样做：

div = soup.find('div',id='all_four_factors')

这将得到以下行，并尝试在其中搜索“ tr”标签。

<div id="all_four_factors" class="table_wrapper floated setup_commented commented">

您可以只使用第一部分中的原始汤变量并执行

tr = soup.find_all('tr')

从表中提取数据的漂亮汤

3 个答案: