Scrapy Crawler在爬行中间停止

时间:2014-10-09 17:57:45

标签: python web-scraping scrapy web-crawler screen-scraping

我遇到了Scrapy蜘蛛的问题。我从一个txt文件中抓取一个url列表,并且在scrape中间,scrapy spider继续关闭。 这是我的一般刮刀逻辑

class DomainArticlesSpider(BaseSpider):
    name = "domain_articles"
    allowed_domains = ["domain.com"]
    f = open("/domain_urls.txt")
    start_urls = f
    def parse:
    item = DomainItem()
    ...
    return item 

蜘蛛将结束:

2014-10-08 21:18:31-0400 [domain_articles] ERROR: Error downloading <GET url>: TCP connection timed out: 110: Connection timed out.

2014-10-08 21:18:31-0400 [domain_articles] INFO: Closing spider (finished)
2014-10-08 21:18:31-0400 [domain_articles] INFO: Stored csv feed (6848 items) in: culture_metadata2.csv
2014-10-08 21:18:31-0400 [domain_articles] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 36,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 30,
     'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 6,
     'downloader/request_bytes': 3301700,
     'downloader/request_count': 7658,
     'downloader/request_method_count/GET': 7658,
     'downloader/response_bytes': 200927645,
     'downloader/response_count': 7622,
     'downloader/response_status_count/200': 6848,
     'downloader/response_status_count/301': 637,
     'downloader/response_status_count/302': 8,
     'downloader/response_status_count/404': 129,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 10, 9, 1, 18, 31, 575068),
     'item_scraped_count': 6848,
     'log_count/DEBUG': 14637,
     'log_count/ERROR': 13,
     'log_count/INFO': 42,
     'response_received_count': 6977,
     'scheduler/dequeued': 7658,
     'scheduler/dequeued/disk': 7658,
     'scheduler/enqueued': 7658,
     'scheduler/enqueued/disk': 7658,
     'start_time': datetime.datetime(2014, 10, 9, 0, 44, 5, 340444)}
2014-10-08 21:18:31-0400 [domain_articles] INFO: Spider closed (finished)

知道导致停止的原因是什么?我不明白为什么会结束。 urls.txt文件中仍有url。当蜘蛛已经到达并重新启动它时,蜘蛛会再次开始工作。

0 个答案:

没有答案