Question

我正在尝试使用 Beautiful Soup 来提取工作的标题。 span 标签中的标题与文本相同。例如：文本是“咖啡师”，但标题也是。到目前为止，我一直在使用 .findall，但不知道它是如何工作的。

示例 html：

<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
 <div class="new topLeft holisticNewBlue desktop">
   <span class="label">new</span>
 </div>
 <span title="Barista">Barista</span>
</h2>

Answer 1

试试这样的方法。

# Imports.
from bs4 import BeautifulSoup

# HTML code.
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
                <div class="new topLeft holisticNewBlue desktop">
                  <span class="label">new</span>
                </div>
                <span title="Barista">Barista</span>
              </h2>'''

# Parsing HTML.
soup = BeautifulSoup(html_str, 'lxml')
# Searching for `span` tags with `title` attributes.
list_html_titles = soup.find_all('span', attrs={'title': True})
# Getting titles from HTML code blocks.
list_titles = [x.text for x in list_html_titles]

Answer 2

您可以利用 beautifulSoup 的递归属性来获取 h2 的直接子元素。

我测试了以下代码示例并且可以正常工作：

from bs4 import BeautifulSoup
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
                <div class="new topLeft holisticNewBlue desktop">
                  <span class="label">new</span>
                </div>
                <span title="Barista">Barista</span>
              </h2>'''
soup = BeautifulSoup(html_str, 'lxml')
title = soup.h2.find('span', recursive=False).text
print(title)

用漂亮的汤很难抓取网页

2 个答案: