查找PubMed文章的PDF

时间:2018-06-09 09:40:52

标签: python web-scraping

我正在努力寻找PubMed ID的PubMed引文的全文PDF。为此,我编写了以下函数:

curl

但是,在许多情况下,oaDOI只会显示目标网页的网址,而不会显示PDF格式的网址。

例如,对于PMID 29879703,oaDOI提供以下output

def getFullText (attributes, pmid):
    doi = findDOi(attributes)
    if doi != None:
        URL = oaDOIpdfURL(doi)
        if URL[0] == 'True':
            downloadPDF(URL[1], pmid)
            return({'doi': doi, 'fullText': 'True'})
        else:
            return({'doi': doi, 'fullText': 'True'})
    else:
        return({'fullText': 'False'})

def findDOi (attributes):
    possibleDoi = []
    dictString = ', '.join("{!s}={!r}".format(key,val) for (key,val) in attributes.items())
    for each in dictString.split(' '):
        if re.match(r'(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)', each):
            possibleDoi.append(each.strip('.\','))
    for each in possibleDoi:
        if checkDOI(each) == 'True':
            return each
            break

def checkDOI (doi):
    response = requests.get('https://dx.doi.org/' + str(doi))
    if response.status_code == 404:
        return('False')
    else:
        return('True')

def downloadPDF (url, id, direct='.../data/'):
    response = requests.get(str(url))
    with open(direct + str(id) + '.pdf', 'wb') as f:
        f.write(response.content)

def oaDOIpdfURL(doi):
        r = requests.get("https://api.unpaywall.org/v2/" + doi + "?email=myEmail").json()
        try:
            return('True', r['url_for_pdf']['url'])
        except:
            try:
                return('True', r['best_oa_location']['url'])
            except:
                return('False', '')

因此,我的脚本会将着陆页的HTML保存为名为{ "best_oa_location": { "evidence": "open (via crossref license)", "host_type": "publisher", "is_best": true, "license": "cc-by-nc", "pmh_id": null, "updated": "2018-06-09T09:23:37.662562", "url": "https://doi.org/10.1159/000490704", "url_for_landing_page": "https://doi.org/10.1159/000490704", "url_for_pdf": null, "version": "publishedVersion" }, ... 的PDF文件。但是,我需要从登录页面下载链接到的页面的PDF(即here)。如何调整此脚本,以便我可以获取PDF,因为许多不同的站点将链接到,并且PDF的URL以不同的方式显示?

0 个答案:

没有答案