试图刮掉一篇纽约时报的文章,而是刮擦幻灯片

时间:2014-04-13 20:45:41

标签: python html request beautifulsoup screen-scraping

我的代码有一个奇怪的问题

    from bs4 import BeautifulSoup
    from bs4.diagnose import diagnose
    import requests

    def get_text(url):
        data=""
        p=requests.get(url).content
        soup=BeautifulSoup(p)    
        paragraphs=soup.select("p.story-body-text.story-content")
        data=p
        text=""
        for paragraph in paragraphs:
            text+=paragraph.text
        text=text.encode('ascii', 'ignore')
        return str(text)

基本上我的代码应该做的是使用" request"来获取html。然后使用BS4找到所有" p.story-body-text.story-content"其中包含实际的文章内容。 它在一些文章上很有用,例如: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0

http://www.nytimes.com/2014/04/13/world/asia/coalition-building-season-in-india.html

但是,它不适用于这些链接:

http://www.nytimes.com/2014/04/06/world/middleeast/break-in-syrian-war-brings-brittle-calm.html?_r=0#

http://www.nytimes.com/2014/02/23/magazine/instagram-travel-diary.html?nav

我认为这是"请求"库因为它没有获取正确的HTML 有任何想法吗? 编辑:pastebin链接http://pastebin.com/n3svnKTQ

0 个答案:

没有答案