如何使用Beautiful Soup在特定元素之前获取特定类的标签计数?

时间:2020-10-20 17:14:42

标签: python beautifulsoup

我想计算所有包含类名<a>且位于包含标题“ Dupont Lewis”的链接之前的md-headline标签。

要定义链接(“ Dupont Lewis”)在页面中的位置,我使用以下代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sortlist.fr/pub'
response= requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())

soup.a = soup.find_all("a", {"class": "md-headline"})
search = soup.select_one('a[title*="Dupont Lewis"]')
if search:
    position = find_all_previous('a[title*="Dupont Lewis"]')
    print(position.count)
else:
    print('None')

但是由于某种原因,我继续获得0。

1 个答案:

答案 0 :(得分:1)

查找所有先前的元素

link = soup.select_one('a[title*="Dupont Lewis"]')
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})

查找所有下一个元素

link = soup.select_one('a[title*="Dupont Lewis"]')
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})

原始问题:为什么在标题为“ Dupont Lewis”的链接之前,我总是得到md-headline类的0个链接?

在网页“ https://www.sortlist.fr/pub”上,类别为md-headline的第一个锚元素也恰好是标题为“ Dupont Lewis”的相同锚元素,即为什么以前的元素计数始终为零(除非网页更改)。

完整示例

import requests
from bs4 import BeautifulSoup

url = 'https://www.sortlist.fr/pub'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

link = soup.select_one('a[title*="Dupont Lewis"]')
print(f"link: {link}")
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})

print(f"\n\nFound {len(previous_md_headlines)} previous md-headlines.")
print("Previous md-headline links:\n")
print(*previous_md_headlines, sep="\n\n")

print(f"Found {len(next_md_headlines)} next md-headlines.")
print("Next md-headline links:\n")
print(*next_md_headlines, sep="\n\n")

输出

link: <a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9kdXBvbnQtbGV3aXM=" target="_blank" title="Dupont Lewis">Dupont Lewis</a>


Found 0 previous md-headlines.
Previous md-headline links:

Found 49 next md-headlines.
Next md-headline links:

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9jb25jZXB0b3J5LTVmMjliMzFhLWExY2YtNDRlYS1iYzA4LWJiMzg2MTkyMmM1OQ==" target="_blank" title="The Collective Story">The Collective Story</a>

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS90aGUtY3Jldw==" target="_blank" title="The Crew Communication">The Crew Communication</a>

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9ub3ZlbWJyZQ==" target="_blank" title="Novembre - Creative Business Partner">Novembre - Creative Business Partner</a>
...