我正在努力寻找PubMed ID的PubMed引文的全文PDF。为此,我编写了以下函数:
curl
但是,在许多情况下,oaDOI只会显示目标网页的网址,而不会显示PDF格式的网址。
例如,对于PMID 29879703,oaDOI提供以下output:
def getFullText (attributes, pmid):
doi = findDOi(attributes)
if doi != None:
URL = oaDOIpdfURL(doi)
if URL[0] == 'True':
downloadPDF(URL[1], pmid)
return({'doi': doi, 'fullText': 'True'})
else:
return({'doi': doi, 'fullText': 'True'})
else:
return({'fullText': 'False'})
def findDOi (attributes):
possibleDoi = []
dictString = ', '.join("{!s}={!r}".format(key,val) for (key,val) in attributes.items())
for each in dictString.split(' '):
if re.match(r'(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)', each):
possibleDoi.append(each.strip('.\','))
for each in possibleDoi:
if checkDOI(each) == 'True':
return each
break
def checkDOI (doi):
response = requests.get('https://dx.doi.org/' + str(doi))
if response.status_code == 404:
return('False')
else:
return('True')
def downloadPDF (url, id, direct='.../data/'):
response = requests.get(str(url))
with open(direct + str(id) + '.pdf', 'wb') as f:
f.write(response.content)
def oaDOIpdfURL(doi):
r = requests.get("https://api.unpaywall.org/v2/" + doi + "?email=myEmail").json()
try:
return('True', r['url_for_pdf']['url'])
except:
try:
return('True', r['best_oa_location']['url'])
except:
return('False', '')
因此,我的脚本会将着陆页的HTML保存为名为{
"best_oa_location": {
"evidence": "open (via crossref license)",
"host_type": "publisher",
"is_best": true,
"license": "cc-by-nc",
"pmh_id": null,
"updated": "2018-06-09T09:23:37.662562",
"url": "https://doi.org/10.1159/000490704",
"url_for_landing_page": "https://doi.org/10.1159/000490704",
"url_for_pdf": null,
"version": "publishedVersion"
},
...
的PDF文件。但是,我需要从登录页面下载链接到的页面的PDF(即here)。如何调整此脚本,以便我可以获取PDF,因为许多不同的站点将链接到,并且PDF的URL以不同的方式显示?