在python中删除网页中的停用词

时间:2014-02-28 04:31:55

标签: python python-3.x ipython-notebook

我尝试了以下程序正常工作:

我想从网页中删除停用词,因此FEED_URL ='http://feeds.feedburner.com/oreilly/radar/atom'它成功运行但当我更改网址时会出现错误

import os

import sys
import json
import feedparser
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'            

def cleanHtml(html):
   return BeautifulStoneSoup(clean_html(html),
            convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

   fp = feedparser.parse(FEED_URL)

   print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
   #print "Fetched %s entries from '%s'" % (len(fp.entries[0])

   blog_posts = []
   for e in fp.entries:
      blog_posts.append({'title': e.title, 'content'
                  : cleanHtml(e.content[0].value), 'link': e.links[0].href})

      out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
      f = open(out_file, 'w')
      f.write(json.dumps(blog_posts, indent=1))
      f.close()
      print ('Wrote output file to %s' % (f.name, ))

但是当我更改网址时,它会显示错误

      FEED_URL = 'http://www.thehindu.com'

错误:

     IndexError                                Traceback (most recent call last)
     <ipython-input-1-b80b4061a360> in <module>()
     14 fp = feedparser.parse(FEED_URL)
     15 
     ---> 16 print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
     17 #print "Fetched %s entries from '%s'" % (len(fp.entries[0])
     18 

     IndexError: list index out of range

所以有人可以帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

您使用的Feed网址看起来不正确。

尝试:

FEED_URL = 'http://www.thehindu.com/?service=rss'

对于其他Feed:http://www.thehindu.com/navigation/?type=rss