使用BeautifulSoup在python中的链接标记之间提取文本

时间:2011-06-06 11:26:01

标签: python text tags extract beautifulsoup

我有一个像这样的HTML代码:

<h2 class="title"><a href="http://www.gurletins.com">My HomePage</a></h2>

<h2 class="title"><a href="http://www.gurletins.com/sections">Sections</a></h2>

我需要提取'a'标签之间的文本(链接描述)。我需要一个数组来存储这些:

a [0] =“我的主页”

a [1] =“章节”

我需要在使用BeautifulSoup的python中执行此操作。

请帮帮我,谢谢!

3 个答案:

答案 0 :(得分:1)

您可以这样做:

import BeautifulSoup

html = """
<html><head></head>
<body>
<h2 class='title'><a href='http://www.gurletins.com'>My HomePage</a></h2>
<h2 class='title'><a href='http://www.gurletins.com/sections'>Sections</a></h2>
</body>
</html>
"""

soup = BeautifulSoup.BeautifulSoup(html)

print [elm.a.text for elm in soup.findAll('h2', {'class': 'title'})]
# Output: [u'My HomePage', u'Sections']

答案 1 :(得分:0)

打印[a.findAll(text = True)for a soup.findAll(&#39; a&#39;)]

答案 2 :(得分:0)

以下代码提取“a”标记与数组存储之间的文本(链接描述)。

>>> from bs4 import BeautifulSoup
>>> data = """<h2 class="title"><a href="http://www.gurletins.com">My 
HomePage</a></h2>
...
... <h2 class="title"><a href="http://www.gurletins.com/sections">Sections</a>
</h2>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqTxt = soup.find_all("h2", {"class":"title"})
>>> a = []
>>> for i in reqTxt:
...     a.append(i.get_text())
...
>>> a
['My HomePage', 'Sections']
>>> a[0]
'My HomePage'
>>> a[1]
'Sections'