Scrapy传递参数并写入MySQL

时间:2015-10-19 00:55:38

标签: python mysql scrapy

我正在开展数据抓取项目,并且是scrapy的新手。它似乎很强大,但也很棘手(起初,至少)。

我的MySQL数据库包括2个表:" thelist"和"数据"。

thelist表是一个实体列表 - 企业,博客,场地等 - 我已经删除了目录网站(使用mechanize,BeautifulSoup和regex)。来自列表的行ID是" thelist_id"在数据表中,这是一个返回thelist表的外键。

现在我想用蜘蛛去每个实体自己的网站并抓取电子邮件。我打算使用一个python脚本来从" thelist"中选择一个实体。并使用os.system运行scrapy并发送命令行参数。

$ scrapy crawl furious -a domain=930.com -a start_url='http://www.930.com/' -a thelist_id=137522

完成抓取后,scrapy应将检索到的电子邮件写回数据库,写入数据表,并且需要命令行参数中的list_id值来写入该列的信息,以便将其与该列表相关联table(主体实体列表)。

以下是各种脚本:

items.py

import scrapy

class FuriousmeItem(scrapy.Item):
    emails = scrapy.Field()
    thelist_id = scrapy.Field()

settings.py

BOT_NAME = 'furiousme'

SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'

ITEM_PIPELINES = [
    'furiousme.pipelines.FuriousmePipeline',
]

furious.py(蜘蛛)

import scrapy
from furiousme.items import FuriousmeItem


class FuriousSpider(scrapy.Spider):
    name = "furious"

    def __init__(self, domain, start_url, thelist_id):
        self.allowed_domains = [domain]
        self.start_urls = [start_url]
        self.thelist_id = thelist_id


    def parse(self, response):
        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath("//body//a"):
            item = FuriousmeItem()
            item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
            item['entity_id'] = self.thelist_id
            yield item 

pipelines.py

import logging
import sys

# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *

sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

logger = logging.getLogger(__name__)


class FuriousmePipeline(object):

    def __init__(self):
        self.seen = []

    def process_item(self, item, spider):
        logger.info(item)

        some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)

        # create a configured "Session" class
        Session = sessionmaker(bind=some_engine)

        # create a Session
        session = Session()

        thelist_id = item.get('entity_id')

        for email in item.get('emails'):
            if not email in self.seen:
                self.seen.append(email)
                try:
                    ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
                except Exception, e:
                    print 'INSERT ERROR: ', thelist_id

        return item

问题:

如何传递FuriousmePipeline使用的命令行参数,例如" thelist_id"是数据库中的行ID,应使用此作为外键列的值将已删除的数据写回数据库,以便将其绑定回原始实体。

2 个答案:

答案 0 :(得分:0)

你的沮丧是完全可以理解的。关于Github上的Scrapy和MySQL的这个example给了我很大的帮助。它包含了写入MySQL数据库所需的所有代码。

答案 1 :(得分:0)

非常感谢@LearnAWK和@Rejected帮助我解决这个问题。

要存储参数,请设置一个项目以将其保存在items.py

import scrapy

class FuriousmeItem(scrapy.Item):
    emails = scrapy.Field()
    entity_id = scrapy.Field()  # this will hold the argument

设置spider以接收def __init__' as below. Then actually store the item via the def parse_dir_contents`中的参数,如下所示。

furious.py(蜘蛛)

import scrapy
from furiousme.items import FuriousmeItem


class FuriousSpider(scrapy.Spider):
    name = "furious"

    def __init__(self, domain, start_url, thelist_id):
        self.allowed_domains = [domain]
        self.start_urls = [start_url]
        self.thelist_id = thelist_id  # this receives the argument


    def parse(self, response):
        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath("//body//a"):
            item = FuriousmeItem()
            item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
            item['entity_id'] = self.thelist_id # this stores the argument
            yield item 

settings.py需要调用管道

settings.py

BOT_NAME = 'furiousme'

SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'

ITEM_PIPELINES = [
    'furiousme.pipelines.FuriousmePipeline',
]

pipelines.py

import logging
import sys

# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *

sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

logger = logging.getLogger(__name__)


class FuriousmePipeline(object):

    def __init__(self):
        self.seen = []

    def process_item(self, item, spider):
        logger.info(item)

        some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)

        # create a configured "Session" class
        Session = sessionmaker(bind=some_engine)

        # create a Session
        session = Session()

        thelist_id = item.get('entity_id')

        for email in item.get('emails'):
            if not email in self.seen:
                self.seen.append(email)
                try:
                    ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
                    conn.execute(ins)  # remember this to write to database!
                except Exception, e:
                    print 'INSERT ERROR: ', thelist_id

        return item

结果: 这会将重复数据删除的数据写入数据库!