使用Python中的BeautifulSoup在链接标记之间提取文本

时间:2015-05-29 23:47:35

标签: python html web-scraping beautifulsoup

我的HTML代码如下所示:

<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>

我尝试提取呈现此HTML时显示的文本。

更具体地说,对于这个例子&#39; a&#39;标签,我试图提取&#34; EZSTORAGE - PACK IT。存储它。 WIN - Nationwide - 限制 - 结束6/30/15&#34;

但我在提取全文时遇到问题,因为它被“img”分解了。标签和&#39; span&#39;。

为了提供更多背景信息,我一直在使用以下代码搜索所有&#39; a&#39;标记并提取链接文本。

for link in soup.find_all('a', id='mylink'):
    raw.append(link)
    link_text = link.contents[0].encode('utf-8')
    sweeps.append(link_text)

#output: 'EZSTORAGE - PACK IT. STORE IT. WIN - '

非常感谢任何见解!

2 个答案:

答案 0 :(得分:0)

使用link.text代替link.contents

,您是不是喜欢这个MWE?
text = """
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(text)

for link in soup.find_all('a', id='mylink'):
    link_text = link.text
    print link_text

结果:

EZSTORAGE - PACK IT. STORE IT. WIN -  Nationwide - Restrictions - Ends 6/30/15

答案 1 :(得分:0)

您可以使用常规查找所有文字

import urllib,urllib2,re

content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''



links=re.findall(r'>(.*?)<',content)
a=""
for link in links:
    a=a+link
print a

返回“EZSTORAGE - PACK IT。STORE IT.WIN - Nationwide - 限制 - 2015年6月30日结束”