应用错误收集

我正在尝试遍历网页并查找并下载所有pdf。我有一个解决方案从另一个问题中找到了使用lmxl查找以.pdf结尾的链接（我发现它比我自己的代码快得多，使用机械化）但我不知道如何使用它来将这些文件保存到文件夹中。 urlretrieve可以和lmxl一起使用吗？如果是的话，怎么用？

我的代码：

import lxml.html 
import urllib2 
import urlparse
from urllib import urlretrieve

base_url = 'http://www.example.html'
folder = "C:\Users\Meelah\Desktop\test_pdfs"

response = urllib2.urlopen(base_url)

tree = lxml.html.fromstring(response.read())

ns = {'re': 'http://exslt.org/regular-expressions'}

for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):
    print urlparse.urljoin(base_url, node.attrib['href']) #
    #code here to save it`

如何使用python和lxml在循环中下载和保存文件

0 个答案: