如何将报纸库用于需要身份验证的网站? 我使用newspaper3k库来下载来自不同新闻网站的几篇文章的html(到目前为止工作得很好)。但是,因为我需要完整的内容,我需要在请求html之前进行身份验证(用户名,密码)。我会很感激任何正确方向的指示!
我认为这必须在我使用newspaper.build()之前发生?
(我只是想在这一点上说,这是我第一次使用python进行编码(或者只是编写任何代码),所以任何帮助都会很棒)
import newspaper #import newspaper library
from newspaper import news_pool
guardian = newspaper.build('https://www.theguardian.com/uk-news/all', language='en', memoize_articles=True)
telegraph = newspaper.build('https://www.telegraph.co.uk/news/uk/', language='en', memoize_articles=True)
dagbladet = newspaper.build('https://www.svd.se/sverige', language='sv', memoize_articles=True)
dagensnyheter = newspaper.build('https://www.dn.se/nyheter/sverige/', language='sv', memoize_articles=True)
allpapers = [guardian, telegraph, dagbladet, dagensnyheter]
for papers in allpapers:
newpathpaper = r'/Users/articles/' + today + "/" + naming #naming is just a variable from further up that gives the name of each newspaper
if not os.path.exists(newpathpaper):
os.makedirs(newpathpaper)
#parsing, downloading and creating files for articles
pointer = 0
while(papers.size() > pointer):
papers_article = papers.articles[pointer]
papers_article.download()
if papers_article.download_state == 2: #checking if article has been downloaded
time.sleep(2)
papers_article.parse()
print(papers_article.url)
#receiving publishing date so it is comparable
published_today = papers_article.publish_date #newspaper extractor
published = str(published_today)[0:10]
#writing html
if published == today: #today was declared earlier
f = open('articles/%s/%s/%s_article_%s.html' %(today, naming, naming, pointer), 'w+') #writing html file
f.write(papers_article.html)
print("written successfully")
count_writes +=1
else:
print("not from today")
else:
print("article %s" %pointer)
print(papers_article.url)
print("Has not downloaded!")
pointer += 1