正则表达式,用于在网页中查找精确的pdf链接

时间:2017-02-27 05:56:08

标签: regex python-3.x web-scraping

鉴于url ='http://normanpd.normanok.gov/content/daily-activity',该网站有三种类型的逮捕,事件和案例摘要。我被要求使用正则表达式来发现Python中所有事件pdf文档的URL字符串。

pdf将在定义的位置下载。

我已经浏览了该链接,发现事件pdf文件的网址格式为:

normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf

我写了代码:

import urllib.request

url="http://normanpd.normanok.gov/content/daily-activity"

response = urllib.request.urlopen(url)

data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)

但是在URL列表中,值为空。 我是python3和regex命令的初学者。任何人都可以帮助我吗?

2 个答案:

答案 0 :(得分:0)

这不是一个明智的方法。相反,使用像bs4(BeautifulSoup)这样的HTML解析库来查找链接,然后只使用正则表达式来过滤结果。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))

for el in links:
    print("http://normanpd.normanok.gov" + el['href'])

输出:

http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf

但是如果你被要求只使用正则表达式,那么尝试一些更简单的方法:

import urllib.request
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
    print("http://normanpd.normanok.gov/" + link)

答案 1 :(得分:0)

使用BeautifulSoup这是一种简单的方法:

soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
    current = link.get('href')
    if current.endswith('pdf') and "Incident" in current:
        links.append('{0}{1}'.format(url,current))