Question

我想使用 bs4 从网站获取文本，但我不断收到此错误，我不知道为什么。这是错误：TypeError：切片索引必须是整数或无或具有 index 方法。

这是我的代码：

from urllib.request import urlopen
import bs4

url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)

html_bytes = page.read()
html = html_bytes.decode("utf-8")

text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)

Answer 1

在这一行：

text = html.find("div", {"class":"gc-score__title"})

你只使用str.find方法，而不是bs4.BeautifulSoup.find方法

所以如果你这样做

soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)

您将摆脱错误。也就是说，该站点正在使用 JavaScript，因此这不会产生您期望的结果。您将需要使用 Selenium 等工具来抓取此站点。

Answer 2

首先，如果你想让 BeautifulSoup 解析数据，你需要让它去做。

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)

然后您可以使用 soup.find 来查找 <div> 标签：

text = soup.find("div", {"class":"gc-score__title"})

这将消除错误。您之所以调用 str.find，是因为 html 是一个字符串，要挑选标签，您需要调用 find 对象的 bs4.BeautifulSoup 方法。

但是除了消除错误之外，该行不会做您想要的。它不会返回任何内容，因为该 url 上的数据不包含标记 <div class="gc-score__title">。

将 html_bytes 的内容复制到文本编辑器以确认这一点。

Python 网页抓取 - 为什么会出现此错误？

2 个答案: