Question

任务很简单：使用Python从以下网址下载所有PDF：

https://www.electroimpact.com/Company/Patents.aspx

我只是Python的初学者。我读了python crawler但是样本处理的是html而不是aspx。我得到的只是空白文件下载。

以下是我的代码：

{{1}}

我的正则表达式只找到了4个PDF。实际上，还有更多的PDF需要提取。为什么呢？

Answer 1

使用lxml.html和cssselect代替re，您将获得所有相关的专利文档路径：

#!/usr/bin/env python
# coding: utf8
from __future__ import absolute_import, division, print_function
import urllib2
from lxml import html


def main():
    url = 'https://www.electroimpact.com/Company/Patents.aspx'
    source = urllib2.urlopen(url).read()
    document = html.fromstring(source)
    patent_paths = [
        a.attrib['href'] for a in document.cssselect('div.PatentNumber a')
    ]
    print(patent_paths)


if __name__ == '__main__':
    main()

如何从aspx网页下载所有PDF？

1 个答案: