Question

鉴于url ='http://normanpd.normanok.gov/content/daily-activity'，该网站有三种类型的逮捕，事件和案例摘要。我被要求使用正则表达式来发现Python中所有事件pdf文档的URL字符串。

pdf将在定义的位置下载。

我已经浏览了该链接，发现事件pdf文件的网址格式为：

normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf

我写了代码：

import urllib.request

url="http://normanpd.normanok.gov/content/daily-activity"

response = urllib.request.urlopen(url)

data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)

但是在URL列表中，值为空。我是python3和regex命令的初学者。任何人都可以帮助我吗？

Answer 1

这不是一个明智的方法。相反，使用像bs4（BeautifulSoup）这样的HTML解析库来查找链接，然后只使用正则表达式来过滤结果。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))

for el in links:
    print("http://normanpd.normanok.gov" + el['href'])

输出：

http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf

但是如果你被要求只使用正则表达式，那么尝试一些更简单的方法：

import urllib.request
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
    print("http://normanpd.normanok.gov/" + link)

Answer 2

使用BeautifulSoup这是一种简单的方法：

soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
    current = link.get('href')
    if current.endswith('pdf') and "Incident" in current:
        links.append('{0}{1}'.format(url,current))

正则表达式，用于在网页中查找精确的pdf链接

2 个答案: