Question

我有一些这样的html布局

<div class="news-a">

 <article>
  <header>
   <h2>
     <a>destination 1</a>
   </h2>
  </header>
 </article>

 <article>
  <header>
   <h2>
     <a>destination 2</a>
   </h2>
  </header>
 </article>

 <article>
  <header>
   <h2>
     <a>destination 3</a>
   </h2>
  </header>
 </article>

</div>

我正在尝试使用BeautifulSoup来返回所有目的地名称，因此我将目标命名为＆＃34; news-a＆＃34;因为我知道网站上只有其中一个。我有我的刮刀代码：

import requests
from bs4 import BeautifulSoup

page = requests.get('url')
soup = BeautifulSoup(page.content, 'html.parser')

destinations = soup.find(class_='news-a')

for destination in destinations.find_all('h2'):
    print(destination.text)

但这只会返回＆＃34;目的地1＆＃34;的第一个结果。与实时网址一起使用时

Sample of inspect code

Answer 1

这个怎么样？更简洁的期望输出：

import requests
from bs4 import BeautifulSoup

page = requests.get('http://www.travelindicator.com/destinations?page=1').text
soup = BeautifulSoup(page,"lxml")
for item in soup.select(".news-a h2 a"):
    print(item.text)

结果：

Con Dao
Kuwait City
Funafuti
Saint Helier
Mount Kailash
Sunny Beach
Krakow
Azores
Alsace
Qaqortoq
Salt Lake City
Valkenburg
Daegu
Lviv
São Luís
Abidjan
Lampedusa
Lecce
Norfolk Island
Petra

Answer 2

一个问题是，您的简化示例与您最初发布然后删除的链接中的HTML完全不同。

请尝试使用travelindicator.com链接：

import requests
from bs4 import BeautifulSoup

page = requests.get('http://www.travelindicator.com/destinations?page=1')
soup = BeautifulSoup(page.content, 'html.parser')
locs = soup.find_all(lambda tag: tag.name == 'a' 
                     and tag.get('href').startswith('locations')
                     and tag.has_attr('title')
                     and not tag.has_attr('style'))

for loc in locs:
    # `locs` is now a Python list of `a` tags, each with an href
    #     and title attribute
    if loc.get('title').startswith('Travel ideas'):
        print(loc.text)
Con Dao
Kuwait City
Funafuti
Saint Helier
Mount Kailash
Sunny Beach
Krakow
Azores
Alsace
Qaqortoq
Salt Lake City
Valkenburg
Daegu
Lviv
São Luís
Abidjan
Lampedusa
Lecce
Norfolk Island
Petra

更多关于原始方法给您带来麻烦的原因：

在该实际链接中，当您使用

时

dest = soup.find('div', attrs={'class':'news-a'})

此标记只有一个您正在寻找的类型的h2属性。您需要find_all

中的<h2> soup.个标记

要查看此内容，请尝试print(dest.prettify())。你会注意到这个嵌套结构中没有包含你想要的城市结果。请注意，find只找到第一个结果。如果您熟悉HTML document tree的概念，则其他div（“news-a”）是您的结果的兄弟姐妹，而不是孩子。

BeautifulSoup只返回一个结果

2 个答案: