Question

我正致力于解析this网页。

我已经table = soup.find("div",{"class","accordions"})获得了固定装置列表（没有别的）但是现在我试图一次遍历每个匹配。看起来每个匹配都以文章元素标记<article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">

开头

但是出于某些原因，当我尝试使用matches = table.findAll("article",{"role","article"})

时

然后打印匹配的长度，我得到0。

我也试图说matches = table.findAll("article",{"about","/fixture/arsenal"})，但也遇到同样的问题。

BeautifulSoup无法解析标签，或者我只是错误地使用它？

Answer 1

试试这个：

matches = table.findAll('article', attrs={'role': 'article'})

Answer 2

原因是findAll正在搜索标签名称。参考bs4 docs

Answer 3

您需要将属性作为字典传递。有三种方法可以获得所需的数据。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')

matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16

或者，这也是一样的：

matches = soup.find_all('article', role='article')

但是，这两种方法都提供了一些没有Arsernal灯具的额外文章标签。因此，如果您想使用/fixture/arsenal找到它们，可以使用CSS selectors。（使用find_all将不起作用，因为您需要部分匹配）

matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13

另外，请查看the keyword arguments。它会帮助你得到你想要的东西。

BeautifulSoup无法解析文章元素

3 个答案: