网络爬虫停止中间刮擦

时间:2019-10-10 16:57:16

标签: python web-crawler

我一直在用python编写网络搜寻器,以从URL列表中查找电子邮件和电话号码的列表。我编写的代码能够正常运行和输出数据,但是我遇到了这样的问题,其中某些网站已死机,并且试图抓取1个URL时代码卡住了。有没有办法限制系统尝试每个网址的时间?

import csv
import urllib, re
from itertools import islice

with open('***', mode='r') as csv_file:
    reader = csv.DictReader(csv_file)

    with open('***', mode='w') as csv_file:
        fieldnames = ['url', 'phone', 'email']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

        writer.writeheader()
        for row in reader:
            #            print(row)
            print(row.values()[0])
            f = urllib.urlopen(row.values()[0])
            s = f.read().decode('utf-8')
            # print(row.values())

            print({'url': row.values(), 'phone': re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", s),
                   'email': re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)})
            writer.writerow({'url': row.values(), 'phone': re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", s),
                             'email': re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)})

我停止运行程序后收到的错误是:

Traceback (most recent call last):
  File "***", line 17, in <module>
    s = f.read().decode('utf-8')
  File "***", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 26045: invalid start byte

任何有关限制该搜寻器的时间或请求失败的帮助都将是惊人的!

0 个答案:

没有答案