在抓取网站后发送带附件的电子邮件

时间:2013-12-12 22:29:55

标签: python email web-crawler scrapy

我在学校项目中使用Scrapy来查找死链接和丢失的页面。我已经编写了用于编写带有相关信息的文本文件的管道。我无法弄清楚如何在使用作为附件制作的文件的蜘蛛末端发送电子邮件。

Scrapy已经内置了电子邮件功能,并在蜘蛛完成时发出信号,但是以一种合理的方式将所有内容组合在一起让我望而却步。任何帮助将不胜感激。

以下是我使用抓取数据创建文件的管道:

class saveToFile(object):

def __init__(self):
    # open files
    self.old = open('old_pages.txt', 'wb')
    self.date = open('pages_without_dates.txt', 'wb')
    self.missing = open('missing_pages.txt', 'wb')

    # write table headers
    line = "{0:15} {1:40} {2:} \n\n".format("Domain","Last Updated","URL")
    self.old.write(line)

    line = "{0:15} {1:} \n\n".format("Domain","URL")
    self.date.write(line)

    line = "{0:15} {1:70} {2:} \n\n".format("Domain","Page Containing Broken Link","URL of Broken Link")
    self.missing.write(line)

def process_item(self, item, spider):

    # add items to file as they are scraped
    if item['group'] == "Old Page":
        line = "{0:15} {1:40} {2:} \n".format(item['domain'],item["lastUpdated"],item["url"])
        self.old.write(line)
    elif item['group'] == "No Date On Page":
        line = "{0:15} {1:} \n".format(item['domain'],item["url"])
        self.date.write(line)
    elif item['group'] == "Page Not Found":
        line = "{0:15} {1:70} {2:} \n".format(item['domain'],item["referrer"],item["url"])
        self.missing.write(line)

    return item

我想创建一个单独的管道项来发送电子邮件。到目前为止我所拥有的是:

class emailResults(object):

def __init__(self):

    dispatcher.connect(self.spider_closed, spider_closed)
    dispatcher.connect(self.spider_opened, spider_opened)

    old = open('old_pages.txt', 'wb')
    date = open('pages_without_dates.txt', 'wb')
    missing = open('missing_pages.txt', 'wb')
    oldOutput = open('twenty_oldest_pages.txt', 'wb')

attachments = [
            ("old_pages", "text/plain", old)
            ("date", "text/plain", date)
            ("missing", "text/plain", missing)
            ("oldOutput", "text/plain", oldOutput)
        ]

        self.mailer = MailSender()
def spider_closed(SPIDER_NAME):

    self.mailer.send(to=["example@gmail.com"], attachs=attachments, subject="test email", body="Some body")

似乎在Scrapy的早期版本中你可以将self传递给spider_closed函数,但在当前版本(0.21)中,spider_closed函数只传递了蜘蛛名称。

非常感谢任何帮助和/或建议。

1 个答案:

答案 0 :(得分:3)

创建邮件程序发送类作为管道并不是最好的主意。最好将其创建为您自己的扩展。您可以在此处详细了解扩展程序:http://doc.scrapy.org/en/latest/topics/extensions.html

最重要的部分是类方法from_crawler。它适用于所有抓取工具,在其中您可以为要拦截的信号注册回调。 例如,我的邮件程序类中的此函数如下所示:

@classmethod
def from_crawler(cls, crawler):
    recipients = crawler.settings.getlist('STATUSMAILER_RECIPIENTS')
    if not recipients:
        raise NotConfigured

    mail = MailSender.from_settings(crawler.settings)
    instance = cls(recipients, mail, crawler)

    crawler.signals.connect(instance.item_scraped, signal=signals.item_scraped)
    crawler.signals.connect(instance.spider_error, signal=signals.spider_error)
    crawler.signals.connect(instance.spider_closed, signal=signals.spider_closed)
    crawler.signals.connect(instance.item_dropped, signal=signals.item_dropped)

    return instance

为方便使用,请记住在您的设置中设置所有必要的数据:

EXTENSIONS = {
    'your.mailer': 80
}

STATUSMAILER_RECIPIENTS = ["who should get mail"]

MAIL_HOST = '***'
MAIL_PORT = ***
MAIL_USER = '***'
MAIL_PASS = '***'