Question

我的代码有一个奇怪的问题

    from bs4 import BeautifulSoup
    from bs4.diagnose import diagnose
    import requests

    def get_text(url):
        data=""
        p=requests.get(url).content
        soup=BeautifulSoup(p)    
        paragraphs=soup.select("p.story-body-text.story-content")
        data=p
        text=""
        for paragraph in paragraphs:
            text+=paragraph.text
        text=text.encode('ascii', 'ignore')
        return str(text)

基本上我的代码应该做的是使用＆＃34; request＆＃34;来获取html。然后使用BS4找到所有＆＃34; p.story-body-text.story-content＆＃34;其中包含实际的文章内容。它在一些文章上很有用，例如： http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0

和 http://www.nytimes.com/2014/04/13/world/asia/coalition-building-season-in-india.html？

但是，它不适用于这些链接：

http://www.nytimes.com/2014/04/06/world/middleeast/break-in-syrian-war-brings-brittle-calm.html?_r=0#

和

http://www.nytimes.com/2014/02/23/magazine/instagram-travel-diary.html?nav

我认为这是＆＃34;请求＆＃34;库因为它没有获取正确的HTML 有任何想法吗？编辑：pastebin链接http://pastebin.com/n3svnKTQ

试图刮掉一篇纽约时报的文章，而是刮擦幻灯片

0 个答案: