Scrapy:爬行后返回ID列表

时间:2016-04-12 07:46:26

标签: python scrapy

我写了一个自定义蜘蛛来递归浏览网站的页面,并将每个抓取的详细信息存储在我的postgres数据库中:

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def __init__(self):
        self.start_urls = ['http://www.example.com']

    def parse(self, response):
        yield scrapy.request(self.start_urls[0], callback=self.parse_page)

    def parse_page(self, response):
        with transaction.manager:
            crawl = Crawl()
            crawl.url = response.request.url
            crawl.response_body = response.body
            Session.add(crawl)
            Session.flush()

        if len(response.css('.pager-next')) == 1:
            # build url for the next page to crawl
            # ...
            yield scrapy.Request(url=full_url, callback=self.parse_page)

问题是我想要获取添加到数据库的爬网的id列表,这是另一个函数可以使用的。

def scrape_website():
    process = CrawlerProcess()
    process.crawl(MySpider)
    process.start() # <-- how to return crawl ids?

    parse_crawls(crawl_ids)

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

您应该使用Item Pipeline将数据存储在Postgresql

查看this article

中的df.loc[:, df.columns.str.endswith('Confidence')] 示例
pipelines.py

不要忘记更新import psycopg2 from scrapy_example_com.items import * class ScrapyExampleComPipeline(object): def __init__(self): self.connection = psycopg2.connect(host='localhost', database='scrapy_example_com', user='postgres') self.cursor = self.connection.cursor() def process_item(self, item, spider): # check item type to decide which table to insert try: if type(item) is CustomerItem: self.cursor.execute("""INSERT INTO customers (id, firstname, lastname, phone, created_at, updated_at, state) VALUES(%s, %s, %s, %s, %s, %s, %s)""", (item.get('id'), item.get('firstname'), item.get('lastname'), item.get('phone'), item.get('created_at'), item.get('updated_at'), item.get('state'), )) elif type(item) is CategoryItem: self.cursor.execute("""INSERT INTO categories (id, name) VALUES(%s, %s)""", (item.get('id'), item.get('code'), )) self.connection.commit() self.cursor.fetchall() except psycopg2.DatabaseError, e: print "Error: %s" % e return item

settings.py