我正在做一个作业,要求我们编写一个程序来爬取给定的静态语料库。在输出中,我的代码显示了所有已爬网的URL,但我知道其中有些是陷阱,但我想不出一种以Python方式过滤掉这些URL的方法。
我使用正则表达式过滤掉了类似tap的url内容,但这在家庭作业中是不允许的,因为它被认为是硬编码。
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login§ok=4d26fc0839d47d4ec13c5461c1ed6d96
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login§ok=d8b984cc6aa00bd1ef20471ac5150094
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login§ok=d8b984cc6aa00bd1ef20471ac5150094
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login§ok=d504a3676483838e82f07064ca3e12ee
以及具有类似结构的其他内容。也有具有类似结构的日历网址,只是更改日期:
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=22&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=25&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=26&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=27&month=01&year=2017
我想从结果中过滤掉这些,但我想不出任何办法。
答案 0 :(得分:0)
我认为这可以解决您的问题
import requests
for url in urls:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except Exception as err:
print(f'Other error occurred: {err}')
else:
print('Url is valid!')