从Reddit拉出头条新闻

时间:2015-01-19 20:50:32

标签: python web-scraping

获取Reddit首页头条新闻的最佳方法是什么?目前我正在使用BeautifulSoup4来尝试抓取它们,但使用Reddit API似乎是一个可行的选择,但我无法在其文档中的任何位置找到要请求顶部标题的URL。像http://www.reddit.com/r/frontpage/top.json?limit=10这样的东西是我猜的,但这不会产生frontpage上的任何头条新闻。

Python Scraper方法:(不工作)

def scrape(url):                                                     
    try:                                                                            
        req = urllib2.Request(url)                                                  
        conn =  urllib2.urlopen(req)                                                
        content = conn.read()

        soup = BeautifulSoup(content)  

        for link in soup.find_all('a'):                                                 
            print link                                    
    except urllib2.URLError, e:                                                     
        print 'Your HTTP error response code is: ', e 

有什么建议吗?

1 个答案:

答案 0 :(得分:3)

关注@ jonrsharpe的评论,有一个python Reddit API客户端:

使用get_top()获得头条新闻:

>>> import praw
>>> r = praw.Reddit(user_agent='my_cool_application')
>>> for item in r.get_top():
...     print item
... 
4901 :: I made a Redundant Clock.
4764 :: Elon Musk plans to launch 4,000 satellites to deliver high-speed Inte...
5144 :: Pipeline breach spills up 50,000 gallons of oil into the Yellowstone ...
4603 :: Avalanche Dog In Training
4564 :: TIL it is illegal in many countries to perform surgical procedures on...
...

还有get_top_from_day()get_top_from_hour()等。