我正在开展数据抓取项目,并且是scrapy的新手。它似乎很强大,但也很棘手(起初,至少)。
我的MySQL数据库包括2个表:" thelist"和"数据"。
thelist表是一个实体列表 - 企业,博客,场地等 - 我已经删除了目录网站(使用mechanize,BeautifulSoup和regex)。来自列表的行ID是" thelist_id"在数据表中,这是一个返回thelist表的外键。
现在我想用蜘蛛去每个实体自己的网站并抓取电子邮件。我打算使用一个python脚本来从" thelist"中选择一个实体。并使用os.system运行scrapy并发送命令行参数。
$ scrapy crawl furious -a domain=930.com -a start_url='http://www.930.com/' -a thelist_id=137522
完成抓取后,scrapy应将检索到的电子邮件写回数据库,写入数据表,并且需要命令行参数中的list_id值来写入该列的信息,以便将其与该列表相关联table(主体实体列表)。
以下是各种脚本:
items.py
import scrapy
class FuriousmeItem(scrapy.Item):
emails = scrapy.Field()
thelist_id = scrapy.Field()
settings.py
BOT_NAME = 'furiousme'
SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'
ITEM_PIPELINES = [
'furiousme.pipelines.FuriousmePipeline',
]
furious.py(蜘蛛)
import scrapy
from furiousme.items import FuriousmeItem
class FuriousSpider(scrapy.Spider):
name = "furious"
def __init__(self, domain, start_url, thelist_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.thelist_id = thelist_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = FuriousmeItem()
item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
item['entity_id'] = self.thelist_id
yield item
pipelines.py
import logging
import sys
# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *
sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
logger = logging.getLogger(__name__)
class FuriousmePipeline(object):
def __init__(self):
self.seen = []
def process_item(self, item, spider):
logger.info(item)
some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)
# create a configured "Session" class
Session = sessionmaker(bind=some_engine)
# create a Session
session = Session()
thelist_id = item.get('entity_id')
for email in item.get('emails'):
if not email in self.seen:
self.seen.append(email)
try:
ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
except Exception, e:
print 'INSERT ERROR: ', thelist_id
return item
问题:
如何传递FuriousmePipeline使用的命令行参数,例如" thelist_id"是数据库中的行ID,应使用此作为外键列的值将已删除的数据写回数据库,以便将其绑定回原始实体。
答案 0 :(得分:0)
你的沮丧是完全可以理解的。关于Github上的Scrapy和MySQL的这个example给了我很大的帮助。它包含了写入MySQL数据库所需的所有代码。
答案 1 :(得分:0)
非常感谢@LearnAWK和@Rejected帮助我解决这个问题。
要存储参数,请设置一个项目以将其保存在items.py
中import scrapy
class FuriousmeItem(scrapy.Item):
emails = scrapy.Field()
entity_id = scrapy.Field() # this will hold the argument
设置spider以接收def __init__' as below. Then actually store the item via the
def parse_dir_contents`中的参数,如下所示。
furious.py(蜘蛛)
import scrapy
from furiousme.items import FuriousmeItem
class FuriousSpider(scrapy.Spider):
name = "furious"
def __init__(self, domain, start_url, thelist_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.thelist_id = thelist_id # this receives the argument
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = FuriousmeItem()
item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]/text()").extract()
item['entity_id'] = self.thelist_id # this stores the argument
yield item
settings.py需要调用管道
settings.py
BOT_NAME = 'furiousme'
SPIDER_MODULES = ['furiousme.spiders']
NEWSPIDER_MODULE = 'furiousme.spiders'
ITEM_PIPELINES = [
'furiousme.pipelines.FuriousmePipeline',
]
pipelines.py
import logging
import sys
# DATABASE
import pymysql
import sqlalchemy
from sqlalchemy.sql import table, column, exists
from sqlalchemy import *
sys.path.append("/Volumes/Orange-1a/^datamine/^scripts/^foundation/")
import music_tables
from music_tables import *
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
logger = logging.getLogger(__name__)
class FuriousmePipeline(object):
def __init__(self):
self.seen = []
def process_item(self, item, spider):
logger.info(item)
some_engine = create_engine('mysql+pymysql://root@127.0.0.1/music_marketing?charset=utf8&use_unicode=0', pool_recycle=3600)
# create a configured "Session" class
Session = sessionmaker(bind=some_engine)
# create a Session
session = Session()
thelist_id = item.get('entity_id')
for email in item.get('emails'):
if not email in self.seen:
self.seen.append(email)
try:
ins = data.insert().values(thelist=thelist_id, tag=22, value=email)
conn.execute(ins) # remember this to write to database!
except Exception, e:
print 'INSERT ERROR: ', thelist_id
return item
结果: 这会将重复数据删除的数据写入数据库!