下载并在XML中包含引用的URL

时间:2016-04-30 09:43:05

标签: python xml rss

我有一个RSS源到新闻来源。在新闻文本和其他元数据中,feed还包含对comments部分的URL引用,该部分也可以是RSS格式。我想下载并包含每篇新闻文章的评论部分的内容。我的目标是创建一个RSS提要,其中包含RSS中包含的每篇文章的文章和评论,然后将这个新的RSS转换为PDF。

以下是XML示例:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <entry>
        <author>
            <name>Some Author</name>
            <uri>http://thenews.com</uri>
        </author>
        <category term="sports" label="Sports" />
        <content type="html">This is the news text.</content>
        <id>123abc</id>
        <link href="http://thenews.com/article/123abc/comments" />
        <updated>2016-04-29T13:44:00+00:00</updated>
        <title>The Title</title>
    </entry>
    <entry>
        <author>
            <name>Some other Author</name>
            <uri>http://thenews.com</uri>
        </author>
        <category term="sports" label="Sports" />
        <content type="html">This is another news text.</content>
        <id>123abd</id>
        <link href="http://thenews.com/article/123abd/comments" />
        <updated>2016-04-29T14:46:00+00:00</updated>
        <title>The other Title</title>
    </entry>
</feed>

现在我要替换&lt; link href =&#34; http://thenews.com/article/123abc/comments" /&GT;与URL的内容。可以通过在URL末尾添加/ rss来获取RSS提要。所以最后,单个条目看起来像这样:

<entry>
  <author>
    <name>Some Author</name>
    <uri>http://thenews.com</uri>
  </author>
  <category term="sports" label="Sports" />
  <content type="html">This is the news text.</content>
  <id>123abc</id>
  <comments>
    <comment>    
      <author>A commenter</author>
      <timestamp>2016-04-29T16:00:00+00:00</timestamp>
      <text>Cool story, yo!</text>
    </comment>
    <comment>
      <author>Another commenter</author>
      <timestamp>2016-04-29T16:01:00+00:00</timestamp>
      <text>This is interesting news.</text>
    </comment>
  </comments>
  <updated>2016-04-29T13:44:00+00:00</updated>
  <title>The Title</title>
</entry>

我对任何编程语言都很开放。我用python和lxml尝试了这个但是无法走远。我能够提取评论网址并下载评论Feed但无法替换实际的&lt; link&gt; -tag。 无需下载实际的RSS,这里有多远:

import lxml.etree as et
import urllib2
import re

# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""

hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

article = et.fromstring(xmltext)

for elem in article.xpath('//feed/entry'):
    commentsURL = elem.xpath('link/@href')

    #request  = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
    #comments = urllib2.urlopen(request).read()
    comments = commentsRSS

    # Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag

1 个答案:

答案 0 :(得分:1)

对于每个<link>元素,从href属性下载XML,然后将XML解析为新的Element。然后将<link>替换为相应的新Element,如下所示:

....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
    request  = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
    comments = urllib2.urlopen(request).read()
    newElem = et.fromstring(comments)
    elem.getparent().replace(elem, newElem)

# print the result
print et.tostring(article)