作为正则表达式模式,我需要传递什么才能从HTML网页中获取html和pdf?到目前为止,我的内容如下。我假设我需要使用OR语句,但这并没有按预期工作。
status, response = http.request("http://www.example.com")
htmlandpdfonly=SoupStrainer('a', href=re.compile('html|pdf'))
for link in BeautifulSoup(response, parseOnlyThese = htmlandpdfonly):
if(link.has_key('href')):
print link['href']
答案 0 :(得分:3)
import re
from BeautifulSoup import BeautifulSoup
# find ".html" or ".pdf" in a string
match = re.compile('\.(html|pdf)')
# parse page content
status, response = http.request("http://www.example.com")
page = BeautifulSoup(response)
# check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
print href
except KeyError:
pass