如何在Python中解析RSS feed中的HTML标记

时间:2014-06-08 12:05:39

标签: python html parsing rss

我有一个小实用程序,用于以纯文本格式生成RSS源的读数。以下是代表性代码:

#!/usr/bin/python

# /usr/lib/xscreensaver/phosphor -scale 3 -program 'python newsfeed.py | tee /dev/stderr | festival --tts'

import sys
import os
import feedparser
from subprocess import call

def printLine():
    terminalRows, terminalColumns = os.popen('stty size', 'r').read().split()
    for i in range(0, int(terminalColumns)):
        sys.stdout.write("-")
    print("\n")

feed = feedparser.parse('http://home.web.cern.ch/scientists/updates/feed')

for post in feed.entries:
    printLine()
    print post.title + "\n"
    print post.description + "\n"
printLine()

运行时,输出如下所示:

-----------------------------------------------------------------------------------------------------

LHC seminar: Higgs boson width

<div class="field-body">
    <p>Constraints on the total Higgs boson width, Gamma_H, are presented using off-shell production and decay to ZZ in the 4l and 2l2nu final states. The analysis is based on data collected in 2012 by the CMS experiment at the LHC, corresponding to an integrated luminosity of L = 19.7/fb at a centre-of-mass energy of 8 TeV. The combined analysis of the 4l and 2l2nu events at high mass with the 4l measurement of the Higgs boson peak at 125.6 GeV leads to an upper limit on the Higgs boson width of Gamma_H &lt; 4.2 x Gamma_H(SM) at the 95% confidence level, assuming Gamma_H(SM) = 4.15 MeV. This result considerably improves over previous experimental constraints from direct measurements at the Higgs resonance peak.</p>
<h2><a href="https://indico.cern.ch/event/313506/">Watch the webcast at 11am CET</a></h2>
  </div>

-----------------------------------------------------------------------------------------------------

Neutrinos and nucleons

<p class="field-byline-taxonomy">
<a href="http://home.web.cern.ch/authors/christine-sutton">Christine Sutton</a></p>
  <div class="field-body">
    <p>On 7 April 1934 the journal <em>Nature</em> published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the <a href="http://home.web.cern.ch/about/experiments/gargamelle">Gargamelle</a> bubble chamber at CERN's <a href="http://home.web.cern.ch/about/accelerators/proton-synchrotron">Proton Synchrotron</a>, they were helping to reveal a more fundamental layer to nature - the quarks.</p>
<p><strong>Read more:</strong> "<a href="http://cerncourier.com/cws/article/cern/56605">Neutrinos and nucleons</a>"- <em>CERN Courier</em></p>
  </div>

-----------------------------------------------------------------------------------------------------

在没有HTML代码的情况下,将此转换为纯文本的大多数RSS源可能是一种明智的方法吗?

1 个答案:

答案 0 :(得分:1)

您可以尝试使用python模块beautifulsoup4(可通过pip获得)。 This question可能会指导您如何将其用于您的目的。

首先:

from bs4 import BeautifulSoup
soup = BeautifulSoup(post.description)
texts = soup.findAll(text = True)
print ''.join(texts)

显示

Christine Sutton

On 7 April 1934 the journal Nature published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the Gargamelle bubble chamber at CERN's Proton Synchrotron, they were helping to reveal a more fundamental layer to nature - the quarks.
Read more: "Neutrinos and nucleons"- CERN Courier