不能脱掉段落标签

时间:2014-07-14 16:27:34

标签: python web-scraping beautifulsoup

编写此代码是为了从bbc中删除新闻内容。到目前为止,它可以工作,但在其中显示段落标记。我已经尝试使用正则表达式删除html标签但仍然无法正常工作。我需要帮助。

由于

import feedparser
from bs4 import BeautifulSoup
import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar 
import time
import os

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders= [('User-agent','Mozilla')]

bbcRSSFeed = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml')

numberstories=[len(bbcRSSFeed)]
FeedLinks=[]
FeedTitles=[]

for post in bbcRSSFeed.entries:
    FeedLinks.append(post.link)
    FeedTitles.append(post.title)

limit=2
counter=0
paraStringList = []

for i in FeedLinks:
    #if counter<FeedLinks: #displays the content of every link
    if counter<limit:
        print "["+i +"]"
        newpage = urlopen(i)
        soup = BeautifulSoup(newpage)
        text = soup.select('.story-body p') #content of the news story
        print (text)
        counter+=1

3 个答案:

答案 0 :(得分:2)

如果您只想要所选元素中的文字,请使用element.get_text() method

text = '\n\n'.join([para.get_text(' ', strip=True) for para in soup.select('.story-body p')])

答案 1 :(得分:1)

  text = "\n".join([s.text for s in soup.select('.story-body p')]) 

答案 2 :(得分:0)

for x in text.contents:
    print( x )

它从<p>给所有内容标记。

BeautifulSoup 3.2.1