Question

在我的代码中我试图拆分一个字符串并将链接（在字符串中）放在一个数组中，使用方法.split（），但是当我尝试这样做时。

ciao = []
for article in soup.find_all("a", {"style": "height:81px;"}):
    ciao = article.get("href").split()
    print(ciao[1])

我收到错误：“IndexError：列表索引超出范围

所以我试图打印出列表

ciao = []
for article in soup.find_all("a", {"style": "height:81px;"}):
    ciao = article.get("href").split()
    print (ciao)

它给了我：

[link1]
[link2]
[link3]
[link4]
[link5]
[link6]
...

而不是

[link1, link2, link3, ...]

你能解释一下我为什么以及如何更正我的代码以获取列表？

Answer 1

您在每次迭代中分配链接到列表和打印，在您指定的每次迭代中，您将覆盖以前的列表。

相反，您可以将链接附加到列表，然后将其打印到您想要的结果中，如下所示：

ciao = []
for article in soup.find_all("a", {"style": "height:81px;"}):
    if article.has_attr("href"):  # this if condition is not necessary but I recommand it while scraping so if a tag without href attribute won't throw an exception
        ciao.append(article.get("href"))
print (ciao)

如果您以后不想使用该列表，只需打印链接，您只需使用end=', '的打印功能，例如print(article.get("href")) for for循环。

列表理解的一个班轮：

ciao=[article.get("href") for article in soup.find_all("a", {"style": "height:81px;"}) if article.has_attr("href")]

Answer 2

我认为你的逻辑不清楚：

soup.find_all("a", {"style": "height:81px;"})

这将检索文章列表，所以

ciao = article.get("href")

将为该文章返回单个链接。要获得链接列表，有几个选项，一个是着名的列表理解：

mylist = [article.get('href') for article in soup.find_all("a", {"style": "height:81px;"})]

你可能也想熟悉map，这被认为有点“复杂”，特别是因为我涉及一个lambda术语：

mylist = list(map(lambda article: article.get('href'),soup.find_all("a", {"style": "height:81px;"})))

如果要迭代列表，可以将其保留为地图。两种解决方案中的逻辑是，您希望通过将soup.find_all应用于每个项目来转换get列表。

Answer 3

如果你想要的是从你的页面中提取标签，这就足够了

a_nodes = soup.find_all("a", {"style": "height:81px;"})
hrefs = [a_node.get('href') for a_node in a_nodes] # and this extracts hrefs from those

您的代码没有拆分，因为您正在尝试拆分单个网址并且其中没有空格（我认为这也不是您想要的）。

Answer 4

这里有三个错误：

您正在调用print(ciao[1])，在Python列表中从0开始索引。这意味着要获取列表中的第一个项目，您将调用print(ciao[0])，或者获取您只能调用的整个列表print(ciao)
您没有添加到列表中，您正在重置它。为此，请使用list.append(item)。
您（在大多数情况下）不想分割链接，并且在此实现中不需要这样做。（从我所看到的，无论如何）

修复这些错误，您将获得新代码：

ciao = []
for article in soup.find_all("a", {"style": "height:81px;"}):
    ciao.append(article.get("href"))
print (ciao)

.split（）不会转换列表中的字符串

4 个答案: