我正在从JAMA上的文章中抓取元数据。我使用了相同代码的细微变化以与其他几本健康/医学期刊完成相同的任务,并且能够获取所需的数据。但是,对于JAMA,我反而得到了以下错误消息:“ ConnectionResetError:[WinError 10054]现有连接被远程主机强行关闭”。
从Google搜索错误消息中,我怀疑这可能是JAMA试图防止拒绝服务攻击。我不认为这是限速的,因为我什至无法从JAMA检索数据。
作为参考,我在下面粘贴了我的代码。
import requests
import urllib
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np
import csv
json_data = []
jama2018 = requests.get('https://jamanetwork.com/journals/jama/issue/319/1')
soup1=BeautifulSoup(jama2018.text, 'lxml')
#Get each issue
for i in soup1:
issue = [a.get('href') for a in soup1.find_all('a', {'class':re.compile('^issue-entry')})]
readuissue = requests.get(issue)
soup2=BeautifulSoup(readissue.text, 'lxml')
#Get each article
articlelinks = [a.get('href') for a in soup2.find_all('a', {'class':'article--full-text'})]
for a in articlelinks:
jamadict={"articletype":"NaN", "title":"NaN", "volume":"NaN", "issue":"NaN", "authors":"NaN", "url":"NaN"}
openarticle= requests.get(a)
soup3 = BeautifulSoup(openarticle.text, 'lxml')
#Metadata for each article
articletype = soup3.find("div", {"class":"meta-article-type thm-col"})
title = soup3.find("meta", {"name":"citation_title"})
volume = soup3.find("meta", {"name":"citation_volume"})
issue = soup3.find("meta", {"name":"citation_issue"})
authors=soup3.find("div", {"class":"meta-authors"})
url = a
if articletype is not None:
jamadict['articletype']=articletype.text.strip()
if title is not None:
jamadict['title']=title['content'].strip()
if volume is not None:
jamadict['volume'] = volume['content'].strip()
if issue is not None:
jamadict['issue'] = issue['content'].strip()
if authors is not None:
jamadict['authors'] = authors.text.strip()
if url is not None:
jamadict['url'] = url
df=pd.DataFrame(json_data)
df.to_csv('jama_2018.csv')
print("Saved")
答案 0 :(得分:0)
只需要一个User-Agent标头
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://jamanetwork.com/journals/jama/issue/319/1', headers=headers)
soup = bs(r.content, 'lxml')