def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
这是从这个html
给我文本的代码 <div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a> </p><p> </p>
</div>
这个结果对我来说很好,我只想排除
的内容 <a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>
来自结果,即我的脚本中的desc
,如果它存在则忽略(如果它不存在)。我怎么能这样做?
答案 0 :(得分:1)
只需对最后一行进行一些更改并添加re模块
...
return re.sub(r'<a(.*)</a>','',desc)
输出:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>