编写此代码是为了从bbc中删除新闻内容。到目前为止,它可以工作,但在其中显示段落标记。我已经尝试使用正则表达式删除html标签但仍然无法正常工作。我需要帮助。
由于
import feedparser
from bs4 import BeautifulSoup
import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
import os
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders= [('User-agent','Mozilla')]
bbcRSSFeed = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml')
numberstories=[len(bbcRSSFeed)]
FeedLinks=[]
FeedTitles=[]
for post in bbcRSSFeed.entries:
FeedLinks.append(post.link)
FeedTitles.append(post.title)
limit=2
counter=0
paraStringList = []
for i in FeedLinks:
#if counter<FeedLinks: #displays the content of every link
if counter<limit:
print "["+i +"]"
newpage = urlopen(i)
soup = BeautifulSoup(newpage)
text = soup.select('.story-body p') #content of the news story
print (text)
counter+=1
答案 0 :(得分:2)
如果您只想要所选元素中的文字,请使用element.get_text()
method:
text = '\n\n'.join([para.get_text(' ', strip=True) for para in soup.select('.story-body p')])
答案 1 :(得分:1)
text = "\n".join([s.text for s in soup.select('.story-body p')])
答案 2 :(得分:0)
for x in text.contents:
print( x )
它从<p>
给所有内容标记。
BeautifulSoup 3.2.1