如何将报纸库用于需要身份验证的网站?

时间:2018-06-11 14:27:51

标签: python-3.x python-newspaper

如何将报纸库用于需要身份验证的网站? 我使用newspaper3k库来下载来自不同新闻网站的几篇文章的html(到目前为止工作得很好)。但是,因为我需要完整的内容,我需要在请求html之前进行身份验证(用户名,密码)。我会很感激任何正确方向的指示!

我认为这必须在我使用newspaper.build()之前发生?

(我只是想在这一点上说,这是我第一次使用python进行编码(或者只是编写任何代码),所以任何帮助都会很棒)

import newspaper #import newspaper library
from newspaper import news_pool

guardian = newspaper.build('https://www.theguardian.com/uk-news/all', language='en', memoize_articles=True)
telegraph = newspaper.build('https://www.telegraph.co.uk/news/uk/', language='en', memoize_articles=True)
dagbladet = newspaper.build('https://www.svd.se/sverige', language='sv', memoize_articles=True)
dagensnyheter = newspaper.build('https://www.dn.se/nyheter/sverige/', language='sv', memoize_articles=True)

allpapers = [guardian, telegraph, dagbladet, dagensnyheter]

for papers in allpapers:
    newpathpaper = r'/Users/articles/' + today + "/" + naming #naming is just a variable from further up that gives the name of each newspaper 
    if not os.path.exists(newpathpaper):
        os.makedirs(newpathpaper)

    #parsing, downloading and creating files for articles
    pointer = 0
    while(papers.size() > pointer):
        papers_article = papers.articles[pointer]
        papers_article.download()
        if papers_article.download_state == 2: #checking if article has been downloaded
            time.sleep(2)
            papers_article.parse()
            print(papers_article.url)

            #receiving publishing date so it is comparable
            published_today = papers_article.publish_date #newspaper extractor
            published = str(published_today)[0:10] 

            #writing html
            if published == today: #today was declared earlier
                 f = open('articles/%s/%s/%s_article_%s.html' %(today, naming, naming, pointer), 'w+') #writing html file
                f.write(papers_article.html)
                print("written successfully")
                count_writes +=1
            else:
                print("not from today")

        else: 
            print("article %s" %pointer)
            print(papers_article.url)
            print("Has not downloaded!")
        pointer += 1

0 个答案:

没有答案