Question

我有像这样的HTML结构

<p class="title">
  <a href="abc.com">
   Story
  </a> 
  <span class="domain">
    <a href="xyz.com">comments</a>
  </span>
</p>

我想提取第一个锚标记的文本，即Story

以下是我使用Beautifulsoup从锚标记

中提取文本的方法

soup = BeautifulSoup(html, 'html.parser')
soup.prettify()
for link in soup.find_all(class_='title'):
      print link.findNext('a').text

并输出：

Story

Comments

但我想只提取第一个锚标记的文本，即Story。我怎么能在python中使用BeautifulSoup来做到这一点？

Answer 1

您可以通过执行

访问第一个a代码

print link.a.text

剥去额外的空格

link.a.text.strip()

Answer 2

您可以通过链接 find()来电并使用get_text()方法来实现这一目标：

soup.find("p", class_="title").a.get_text(strip=True)

其中.a相当于.find("a")中的BeautifulSoup。

或者，使用CSS selector：

soup.select_one("p.title > a").get_text(strip=True)

Answer 3

如果您只想要第一个锚点的文本，那么您不需要find使用该类。

你没有对class="title"说些什么。

In [9]: html = """
<p class="title">
  <a href="abc.com">
   Story
  </a>
  <span class="domain">
    <a href="xyz.com">comments</a>
  </span>
</p>
"""
In [10]: soup = BeautifulSoup(html, "html.parser")
In [11]: soup.a.text.strip()
Out[11]: u'Story'

Python：如何使用BeautifulSoup查找第一个锚标记的文本

3 个答案: