Question

我正在开展数据抓取项目，并且是scrapy的新手。它似乎很强大，但也很棘手（起初，至少）。

我的MySQL数据库包括2个表：＆＃34; thelist＆＃34;和＆＃34;数据＆＃34;。

thelist表是一个实体列表 - 企业，博客，场地等 - 我已经删除了目录网站（使用mechanize，BeautifulSoup和regex）。来自列表的行ID是＆＃34; thelist_id＆＃34;在数据表中，这是一个返回thelist表的外键。

现在我想用蜘蛛去每个实体自己的网站并抓取电子邮件。我打算使用一个python脚本来从＆＃34; thelist＆＃34;中选择一个实体。并使用os.system运行scrapy并发送命令行参数。

$ scrapy crawl furious -a domain=930.com -a start_url='http://www.930.com/' -a thelist_id=137522

完成抓取后，scrapy应将检索到的电子邮件写回数据库，写入数据表，并且需要命令行参数中的list_id值来写入该列的信息，以便将其与该列表相关联table（主体实体列表）。

以下是各种脚本：

items.py

import scrapy

class FuriousmeItem(scrapy.Item):
    emails = scrapy.Field()
    thelist_id = scrapy.Field()

settings.py

BOT_NAME = 'furiousme'

SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'

ITEM_PIPELINES = [
    'furiousme.pipelines.FuriousmePipeline',
]

furious.py（蜘蛛）

import scrapy
from furiousme.items import FuriousmeItem


class FuriousSpider(scrapy.Spider):
    name = "furious"

    def __init__(self, domain, start_url, thelist_id):
        self.allowed_domains = [domain]
        self.start_urls = [start_url]
        self.thelist_id = thelist_id


    def parse(self, response):
        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath("//body//a"):
            item = FuriousmeItem()
            item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
            item['entity_id'] = self.thelist_id
            yield item

pipelines.py

import logging
import sys

# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *

sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

logger = logging.getLogger(__name__)


class FuriousmePipeline(object):

    def __init__(self):
        self.seen = []

    def process_item(self, item, spider):
        logger.info(item)

        some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)

        # create a configured "Session" class
        Session = sessionmaker(bind=some_engine)

        # create a Session
        session = Session()

        thelist_id = item.get('entity_id')

        for email in item.get('emails'):
            if not email in self.seen:
                self.seen.append(email)
                try:
                    ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
                except Exception, e:
                    print 'INSERT ERROR: ', thelist_id

        return item

问题：

如何传递FuriousmePipeline使用的命令行参数，例如＆＃34; thelist_id＆＃34;是数据库中的行ID，应使用此作为外键列的值将已删除的数据写回数据库，以便将其绑定回原始实体。

Answer 1

你的沮丧是完全可以理解的。关于Github上的Scrapy和MySQL的这个example给了我很大的帮助。它包含了写入MySQL数据库所需的所有代码。

Answer 2

非常感谢@LearnAWK和@Rejected帮助我解决这个问题。

要存储参数，请设置一个项目以将其保存在items.py

中

import scrapy

class FuriousmeItem(scrapy.Item):
    emails = scrapy.Field()
    entity_id = scrapy.Field()  # this will hold the argument

设置spider以接收def __init__' as below. Then actually store the item via the def parse_dir_contents`中的参数，如下所示。

furious.py（蜘蛛）

import scrapy
from furiousme.items import FuriousmeItem


class FuriousSpider(scrapy.Spider):
    name = "furious"

    def __init__(self, domain, start_url, thelist_id):
        self.allowed_domains = [domain]
        self.start_urls = [start_url]
        self.thelist_id = thelist_id  # this receives the argument


    def parse(self, response):
        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath("//body//a"):
            item = FuriousmeItem()
            item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
            item['entity_id'] = self.thelist_id # this stores the argument
            yield item

settings.py需要调用管道

settings.py

BOT_NAME = 'furiousme'

SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'

ITEM_PIPELINES = [
    'furiousme.pipelines.FuriousmePipeline',
]

pipelines.py

import logging
import sys

# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *

sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

logger = logging.getLogger(__name__)


class FuriousmePipeline(object):

    def __init__(self):
        self.seen = []

    def process_item(self, item, spider):
        logger.info(item)

        some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)

        # create a configured "Session" class
        Session = sessionmaker(bind=some_engine)

        # create a Session
        session = Session()

        thelist_id = item.get('entity_id')

        for email in item.get('emails'):
            if not email in self.seen:
                self.seen.append(email)
                try:
                    ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
                    conn.execute(ins)  # remember this to write to database!
                except Exception, e:
                    print 'INSERT ERROR: ', thelist_id

        return item

结果：这会将重复数据删除的数据写入数据库！

Scrapy传递参数并写入MySQL

2 个答案: