Question

我正在尝试从bulbapedia抓取状态

我想从每个页面获取这张表

此表不在页面中的特定位置\有时会有多个

我希望我的脚本在页面中查找表，如果找到1，则返回element标签并忽略其他标签

以下是表格在不同位置的一些页面：

page 1

page 2

page 3

我只想选择表格元素，然后提取所需的数据

Answer 1

使用没有特定ID或类的'wiki'页面时，您真正想做的-是找到任何类型的特定特征，以将焦点对象与其他对象区分开。

在您的情况下，如果我们分析所有三个页面，则“状态表”始终具有a标记，其中href始终是/wiki/Statistic。

因此，要查找此特定表，您有两个选择：

找到每个包含a标记且其中href等于/wiki/Statistic的表
使用table查找每个链接的父标签href等于/wiki/Statistic

以下是代码示例：

from bs4 import BeautifulSoup
import requests

pages = [
    'https://bulbapedia.bulbagarden.net/wiki/Charmander_(Pokémon)',
    'https://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pokémon)',
    'https://bulbapedia.bulbagarden.net/wiki/Eternatus_(Pokémon)'
]

for page in pages:
    response = requests.get(page)
    soup = BeautifulSoup(response.text, 'html.parser')

    stat_tables = [table for table in soup.find_all('table') if table.find('a') != None and table.find('a')['href'] == '/wiki/Statistic']
    # OR
    stat_tables = [a.find_parent('table') for a in soup.find_all('a', href = '/wiki/Statistic')]

    for table in stat_tables:
        # Parse table

由于您说过您只想提取表，所以我将解析部分留给您:) 但是，如果您有任何疑问，请随时提问。

使用BeautifulSoup在没有匹配页面的情况下进行Web抓取数据

1 个答案: