如何使用World Bank API通过Python下载许多PDF文件

时间:2019-06-05 02:32:51

标签: python api pdf web-scraping

我正在尝试使用Python从世界银行档案网站下载许多pdf文件(几百个)。可以使用任何术语选择(例如,国家或行业教育,卫生等)自定义API Web链接。

我已尝试使用上述网址尝试以下代码来下载越南教育部门专用的文件。该URL包含带有指定条款的所有pdf链接的运营文档。但是,无法下载文件。

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://search.worldbank.org/api/v2/wds?format=json&countcode=VN&majdocty_exact=Publications&teratopic_exact=Education&srt=docdt&order=desc"

#Folder to download the files
folder_location = r'J:\New Volume (B)\pdfs'

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files 
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

运行代码后,我没有发现任何错误,但是我也无法下载任何文件。任何帮助将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:0)

使用response.json();不需要bs4

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://search.worldbank.org/api/v2/wds?format=json&countcode=VN&majdocty_exact=Publications&teratopic_exact=Education&srt=docdt&order=desc"

#Folder to download the files
folder_location = r'J:\New Volume (B)\pdfs'

response = requests.get(url).json()
for i in response['documents']:
    url=(response['documents'][i].get('pdfurl'))
    if url:
        filename = os.path.join(folder_location,url.split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(requests.get(url).content)