Question

def get_description(link):
    redditFile = urllib2.urlopen(link)
    redditHtml = redditFile.read()
    redditFile.close()
    soup = BeautifulSoup(redditHtml)
    desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
    return desc

这是从这个html

给我文本的代码

    <div class="op_gd14 FL">
    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>  </p><p>                                                </p>

</div>

这个结果对我来说很好，我只想排除

的内容

<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>

来自结果，即我的脚本中的desc，如果它存在则忽略（如果它不存在）。我怎么能这样做？

Answer 1

只需对最后一行进行一些更改并添加re模块

...
return re.sub(r'<a(.*)</a>','',desc)

输出：

'<div class="op_gd14 FL">\n    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  \n  </p><p>

使用beautifulsoup从结果中删除特定内容

1 个答案: