我一直在用python编写网络搜寻器,以从URL列表中查找电子邮件和电话号码的列表。我编写的代码能够正常运行和输出数据,但是我遇到了这样的问题,其中某些网站已死机,并且试图抓取1个URL时代码卡住了。有没有办法限制系统尝试每个网址的时间?
import csv
import urllib, re
from itertools import islice
with open('***', mode='r') as csv_file:
reader = csv.DictReader(csv_file)
with open('***', mode='w') as csv_file:
fieldnames = ['url', 'phone', 'email']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for row in reader:
# print(row)
print(row.values()[0])
f = urllib.urlopen(row.values()[0])
s = f.read().decode('utf-8')
# print(row.values())
print({'url': row.values(), 'phone': re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", s),
'email': re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)})
writer.writerow({'url': row.values(), 'phone': re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", s),
'email': re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)})
我停止运行程序后收到的错误是:
Traceback (most recent call last):
File "***", line 17, in <module>
s = f.read().decode('utf-8')
File "***", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 26045: invalid start byte
任何有关限制该搜寻器的时间或请求失败的帮助都将是惊人的!