退出:scrapy(退出状态0;不预期)

时间:2015-08-21 11:50:43

标签: python docker scrapy supervisord supervisor

我尝试运行bash脚本,在我的Docker容器中启动多个蜘蛛。 放置在“supervisor.conf”中的/etc/supervisor/conf.d/就像这样:

[program:scrapy]                                                            
command=/tmp/start_spider.sh
autorestart=false
startretries=0
stderr_logfile=/tmp/start_spider.err.log
stdout_logfile=/tmp/start_spider.out.log

但主管会返回此错误:

  

2015-08-21 10:50:30,466 CRIT Supervisor以root身份运行(没有用户进入   配置文件)

     

2015-08-21 10:50:30,466 WARN包含额外文件   解析期间的“/etc/supervisor/conf.d/tor.conf”

     

2015-08-21 10:50:30,478 INFO RPC接口'supervisor'已初始化

     

2015-08-21 10:50:30,478 CRIT服务器'unix_http_server'在没有运行的情况下运行   任何HTTP身份验证检查

     

2015-08-21 10:50:30,478 INFO supervisord以pid 5开始

     

2015-08-21 10:50:31,481 INFO催生:'scrapy'with pid 8

     

2015-08-21 10:50:31,555 INFO退出:scrapy(退出状态0;不是   预期)

     

2015-08-21 10:50:32,557 INFO放弃了:scrapy也进入了致命状态   许多人开始重试太快

我的程序停止运行。但如果我手动运行我的程序,它的效果非常好......

如何解决这个问题?任何想法?

2 个答案:

答案 0 :(得分:2)

我找到了解决问题的方法。对于private void DrawRelationshipLines() { _canvas = new ShapeContainer {Parent = panelCredentialsVisualisation}; //These methods below do not redraw the canvas _canvas.Shapes.Remove(_tableinfoLine); _canvas.Shapes.Clear(); _canvas.Refresh(); _canvas.Update(); // List<string> relationships = lvSelectedTableInfoCredentialsIntersection.GetAllRelationships(); if (relationships.Capacity == 0) return; foreach (string context in relationships) { Label contextLabelName = GetLabelByName(context); _tableinfoLine = new LineShape { Parent = _canvas, BorderWidth = 2, BorderColor = Color.BlueViolet, StartPoint = new Point(lblselectedTableinfo.Right, lblselectedTableinfo.Top + 10), EndPoint = new Point(contextLabelName.Left, contextLabelName.Top + 10) }; } ,请更改

supervisor.conf

由:

[program:scrapy]                                                       
        command=/tmp/start_spider.sh
        autorestart=false
        startretries=0

答案 1 :(得分:0)

这是我的代码:

start_spider.sh

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-

from scrapy.selector import Selector
from elasticsearch import Elasticsearch
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from urlparse import urljoin
from bs4 import BeautifulSoup
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
from tools import sendEmail
from tools import ElasticAction
from tools import runlog
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from datetime import datetime
import re

class studentCrawler(BaseSpider):
    # Crawling Start
    CrawlSpider.started_on = datetime.now()

    name = "root"


    DOWNLOAD_DELAY = 0

    allowed_domains = ['website.com']

    ES_Index = "website"
    ES_Type = "root"
    ES_Ip = "127.0.0.1"

    child_type = "level1"

    handle_httpstatus_list = [404, 302, 503, 999, 200] #add any other code you need

    es = ElasticAction(ES_Index, ES_Type, ES_Ip)

    # Init
    def __init__(self, alpha=''):

        base_domain = 'https://www.website.com/directory/student-' + str(alpha) + "/"

        self.start_urls = [base_domain]
        super(CompanyCrawler, self).__init__(self.start_urls)


    def is_empty(self, any_structure):
        """
        Function that allow to check if the data is empty or not
        :arg any_structure: any data
        """
        if any_structure:
            return 1
        else:
            return 0

    def parse(self, response):
        """
        main method that parse the web page
        :param response:
        :return:
        """

        if response.status == 404:
            self.es.insertIntoES(response.url, "False")
        if str(response.status) == "503":
            self.es.insertIntoES(response.url, "False")
        if response.status == 999:
            self.es.insertIntoES(response.url, "False")

        if str(response.status) == "200":
            # Selector
            sel = Selector(response)

            self.es.insertIntoES(response.url, "True")
            body = self.getAllTheUrl('u'.join(sel.xpath(".//*[@id='seo-dir']/div/div[3]").extract()).strip(),response.url )


    def getAllTheUrl(self, data, parent_id):
        dictCompany = dict()
        soup = BeautifulSoup(data,'html.parser')
        for a in soup.find_all('a', href=True):
            self.es.insertChildAndParent(self.child_type, str(a['href']), "False", parent_id)

这是我的scrapy代码:

func tableView(tableView: UITableView, cellForRowAtIndexPath indexPath: NSIndexPath) -> UITableViewCell {
    var cell = tableView.dequeueReusableCellWithIdentifier("yourcell") as! UITableViewCell

    if (cell.respondsToSelector("setPreservesSuperviewLayoutMargins:")){
        cell.layoutMargins = UIEdgeInsetsZero
        cell.preservesSuperviewLayoutMargins = false
    }
}

我发现,当蜘蛛由主管发起时,BeautifulSoup无效......