Question

我想从表格的 th 标签中提取文本，以便可以从Wikipedia页面中的表格中打印地铁站列表。我只需要某个表格中的文本（页面中有两个）

import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_London_Underground_stations"
page = urllib.request.urlopen(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")

stations_table = soup.find("table", class_= "wikitable sortable plainrowheaders")
stations_table

for i in soup.find_all('th', stations_table):
    print(i.text)

我可以获取存储在stations_table变量中的表，但是不能在th表中的wikitable sortable plainrowheaders标记中打印文本。在确实打印站点名称的同时，它还会打印标题：

站当地政府区域[†] 打开[4] 主线开通用法[5]

如何过滤掉这些内容？

Answer 1

它显示表中的所有th-不仅显示电台，还显示诸如Stations，Lines之类的标题

要跳过它，我将搜索所有tr，跳过第一行，然后在每一行中搜索th

for i in stations_table.find_all('tr')[1:]
    print(i.find('th').text.strip())

完整代码

import urllib.request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_London_Underground_stations"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")

stations_table = soup.find("table", class_= "wikitable sortable plainrowheaders")

for i in stations_table.find_all('tr')[1:]:
    print(i.find('th').text.strip())
    #print(i.th.text.strip())

Answer 2

for i in soup.find_all('th', stations_table):

搜索所有表标题和表行。为此，可以提取所有行并从第二行开始打印（忽略标题行），如下所示

for i in stations_table.find_all('tr')[1:]:
    print(i.find('th').text)

使用美丽汤从选定标签中提取文本

2 个答案: