Question

我正在尝试抓取此URL，以便只能获取某些索引。在这种情况下，我显示的示例可以刮取索引6，该索引将为我提供以/ wiki /开头的所有URL。这将为我提供所有以A开头的TLD。我想获取与任务相关的所有索引。

到目前为止，我已经尝试将它们以[6、7、8等]的形式列出并用引号引起来。不过，我在列表方面的工作并不多，我需要花更多的时间来学习。

import requests
from bs4 import BeautifulSoup 

page = requests.get('https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains')
soup = BeautifulSoup(page.text, 'lxml')

table = soup.findAll('table')[6]
for record in table.findAll('tr'):
    for data in record.findAll('td'):
        for link in data.select("a[href^='/wiki/.']"):
            links = link.contents[0]
            print(links)

但是，由于我是编程新手，所以我不确定如何添加除6外的多个索引。这些是我收到的错误：

======= RESTART: /run/media/sean/The Continuum/Python/wikinamelist.py =======
Traceback (most recent call last):
  File "/run/media/sean/The Continuum/Python/wikinamelist.py", line 7, in <module>
    table_data = soup.find_all('table')["6", "7"]
TypeError: list indices must be integers or slices, not tuple
>>> 
======= RESTART: /run/media/sean/The Continuum/Python/wikinamelist.py =======
Traceback (most recent call last):
  File "/run/media/sean/The Continuum/Python/wikinamelist.py", line 7, in <module>
    table_data = soup.find_all('table')[6, 7];
TypeError: list indices must be integers or slices, not tuple
>>> 
======= RESTART: /run/media/sean/The Continuum/Python/wikinamelist.py =======
Traceback (most recent call last):
  File "/run/media/sean/The Continuum/Python/wikinamelist.py", line 7, in <module>
    table_data = soup.find_all('table')[6, 7, 8];
TypeError: list indices must be integers or slices, not tuple

如您在上面看到的，我尝试了多种方法，这些方法已尝试显示在错误消息中。

任何反馈将不胜感激，谢谢！

Answer 1

您也许可以使用逗号分隔的第n个类型

table:nth-of-type(6), table:nth-of-type(7), table:nth-of-type(8)

所以

tables = soup.select('table:nth-of-type(6), table:nth-of-type(7), table:nth-of-type(8)')

然后

for table in table:

您也可以冷凝

links = [item['href'] for item in soup.select("table:nth-of-type(6) [href^='/wiki/.'], table:nth-of-type(7) [href^='/wiki/.'], table:nth-of-type(8) [href^='/wiki/.']")

您也许还可以针对类选择器拔出table的类型选择器，例如.wikitable。这样会更快。

pd.read_html：

如果pd.read_html返回表，那么您可以索引/切片到该列表中以获得所需的表。

Answer 2

int

这是在页面上抓取多个列表的最简单方法。这将分别针对第一列，并遍历每个表。

以上答案确实对我的问题有所帮助！但是，我对以上建议进行了更改。我没有按照建议的那样压缩代码，而是创建了一个变量列表，该列表将选择我要的表。然后，我将变量中的信息打印到STDout。此代码更具可读性，并且更具模块化。

变量也对应于表的名称。

感谢您的帮助，在您提出建议后，这变得非常简单。

如何使用BeautifulSoup使用索引抓取多个表？

2 个答案: