Python - RSS Web Scraping - 选择正确的元素

时间:2014-01-18 14:56:27

标签: python xml rss

我发布了一篇文章来帮助我从RSS提要中删除数据的输出格式。

我收到的答案正是我所需要的,输出格式现在也是必需的。

更新的代码如下:

import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/5.0')]

def main():
    try:
        page = 'http://feeds.link.co.uk/thelink/rss.xml'
        sourceCode = opener.open(page).read()

        try:
            titles = re.findall(r'<title>(.*?)</title>',sourceCode)
            desc = re.findall(r'<description>(.*?)</description>',sourceCode)
            links = re.findall(r'<link>(.*?)</link>',sourceCode)
            pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode)

            for i in range(len(titles)):
                print titles[i]
                print desc[i]
                print links[i]
                print pub[i]
                print ""

        except Exception, e:
            print str(e)

    except Exception, e:
        print str(e)

main() 

这会按照我的意愿运行并输出到控制台,但是当元素不匹配时,我会收到“列表索引超出范围”错误。

我从中提取数据的xml在标题中有一些元素使用,导致标题,描述和链接不按顺序导致错误。

xml如下:

<rss>  
  <channel> 
    <title>Title1</title>  #USING THIS WOULD BE OK, BUT **
    <link>http://link.co.uk</link>  
    <description>The descriptor</description>  
    <language>en-gb</language>  
    <lastBuildDate>Sat, 18 Jan 2014 06:32:19 GMT</lastBuildDate>  
    <copyright>Usable</copyright>  
    <image> #**THIS IS THE AREA I WANT TO EXCLUDE!!
      <url>http://link.co.uk.1gif</url>  
      <title>Title2</title> #DONT WANT THIS ELEMENT!! 
      <link>http://link.co.uk/info</link>  
      <width>120</width>  
      <height>60</height> 
    </image>  #**THIS IS THE AREA I WANT TO EXCLUDE!!
    <ttl>15</ttl>  
    <atom:link href="http://thelink" rel="self" type="application/rss+xml"/>  ###
    <item> #I WANT TO START THE SCRAPE FROM HERE!!
      <title>Title3</title>  
      <description>This will be the first decription.</description>  
      <link>http://www.thelink3.co.uk</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790820</guid>  
      <pubDate>Sat, 18 Jan 2014 09:53:10 GMT</pubDate>  
    </item>  
    <item> 
      <title>Title4</title>  
      <description>This will be the second description.</description>  
      <link>http://www.thelink3.co.uk/second link</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790635</guid>  
      <pubDate>Sat, 18 Jan 2014 09:56:14 GMT</pubDate>   
    </item>  #I WANT THE SCRAPE TO END HERE
</rss>

有没有办法更改python代码以确保它错过了标题元素并只使用下面的常见元素?

我已经检查了一些RSS提要并且它们以相同的方式创建,因此我编写代码使用此代码并更改URL以从几个RSS提要中删除以在raspberry Pi控制台上使用。

任何帮助都非常感激。

3 个答案:

答案 0 :(得分:0)

您是否尝试过使用BeautifulSoup4?找到你想要的元素会容易得多。

使用这样的代码:

title = soup.find('title')
if title:
    print title.text

此外,为了不让“元素超出范围错误,您可以先检查列表中是否有足够的元素:

if len(titles) < i: # Doens't have the index
    return

我希望这会有所帮助:)

答案 1 :(得分:0)

您应该使用正确的xml解析器,例如Beautiful Soup,而不是正则表达式。

from bs4 import BeautifulSoup

data = sourceCode # your sourceCode variable from your main() function

soup = BeautifulSoup(data)
for item in soup.find_all('item'):
    for tag in ['title', 'description', 'link', 'pubdate']:
        print(tag.upper(), item.find(tag).text)
    print()

输出:

TITLE Title3
DESCRIPTION This will be the first decription.
LINK 
PUBDATE Sat, 18 Jan 2014 09:53:10 GMT

TITLE Title4
DESCRIPTION This will be the second description.
LINK 
PUBDATE Sat, 18 Jan 2014 09:56:14 GMT

答案 2 :(得分:0)

嗯,我能说什么????

BeautifulSoup可以为我节省很多打字:)

import urllib2
from bs4 import BeautifulSoup
url = "http://feeds.link.co.uk/thelink/rss.xml"
sourceCode = urllib2.urlopen(url).read()

data = sourceCode 

soup = BeautifulSoup(data)
for item in soup.find_all('item'):
    for tag in ['title', 'description', 'link', 'pubdate']:
        print(tag.upper(), item.find(tag).text)
    print()