使用python3.7从列表中的URL下载PDF

时间:2019-01-14 08:23:24

标签: python-3.x pdf download

我有一个Python脚本,该脚本使用硒从website抓取URL并将其存储在列表中,然后我想使用wget模块下载它们。

这是代码的相关部分,其中脚本完成了从网站获得的部分URL:

new_links = []
for link in list_of_links: # trim links
    current_strings = link.split("/consultas/coleccion/window.open('")
    current_strings[1] = current_strings[1].split("');return")[0]
    new_link = current_strings[0] + current_strings[1]
    new_links.append(new_link)

for new_link in new_links:
    wget.download(new_link)

脚本目前不执行任何操作。它从不下载任何PDF,也不会发出错误消息。

在第二个for循环中我做了什么错事?

编辑:

关于new_links是否为空的问题。不是。

print(*new_links, sep = '\n')

为我提供了以下链接(这里只是其中的四个):

http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D

编辑2:

部分URL看起来像/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D

然后将“基本URL”添加到http://digesto.asamblea.gob.ni之前。

这是代码的相关部分,位于上面的代码之前,它收集部分URL:

list_of_links = []    # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url   # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
    row.find_element_by_css_selector('button').click()
    link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # get partial link
    if link.startswith('/'):
        list_of_links.append(tld + link) # add base to partial link
    else:
        list_of_links.append(current_url + link)
    row.find_element_by_css_selector('button').click()

1 个答案:

答案 0 :(得分:1)

您的循环正在运行。您可以尝试将您的wget版本升级到3.2并检查

new_links = ['http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D',
'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D']

for new_link in new_links:
    wget.download(new_link)

Output: four files got downloaded in the name of pdf.php, pdf(1).php .,etc