使用Request Meta从scrapy更新数据库表

时间:2016-03-26 15:27:22

标签: python mysql scrapy

这是我在stackoverflow的第一个问题 我正在玩Scrapy,并且在我从Scrapy获取链接后,我想要将数据库链接更新为scanning = 1。

# -*- coding: utf-8 -*-
import scrapy
import scrapy.http
from scrapy.spiders import CrawlSpider, Rule
from Testing.items import Testing100Item
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Response
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.responsetypes import Response
import re
import MySQLdb
from MySQLdb.cursors import SSCursor
import MySQLdb.cursors

##This is the connector to Database to Read New Domains
def getdomainsfromdb():
    try:
        conn = MySQLdb.connect(
            host="localhost",
            user="root",
            passwd="root",
            db="Testing",
            cursorclass = MySQLdb.cursors.SSCursor)
        cursor = conn.cursor()
        query = """
                SELECT domain_id, url, id_sitemap_links
                from Sitemap_links
                where scanned = 0;"""
        cursor.execute(query)
        return cursor.fetchall()
    except Exception, e:
        print e

##This will update the scanned to 1
def scanned(id_sitemap_links):
    try:
        conn = MySQLdb.connect(
            host="localhost",
            user="root",
            passwd="root",
            db="Testing",
            cursorclass = MySQLdb.cursors.SSCursor)
        cursor = conn.cursor()
        query = """
            UPDATE Sitemap_links
            set scanned = 1
            where id_sitemap_links = '%s' """
        cursor.execute(query, (int(id_sitemap_links),))
    except Exception, e:
        print e

class Testing100Spider(scrapy.Spider):
    name = "testing100"
    #allowed_domains = []
    #start_urls = ()

    def start_requests(self):
        for domain_id, url, id_sitemap_links in getdomainsfromdb():
            yield Request(url, callback=self.parse, meta={'id_sitemap_links': id_sitemap_links})

    def parse(self, response):

        # domain_id = response.meta['domain_id']
        id_sitemap_links = response.meta['id_sitemap_links']
        scanned(id_sitemap_links)
        print id_sitemap_links



        # def parse(self, response):
        #     domain_id = Request(0)
        #     item = Testing100Item()
        #     #items = []

此时我可以从getdomainsfromdb()函数读取域名,但我无法更新scrapy正在处理的域名的ID。 我能够打印id_sitemap_links,但SQL没有更新..

我在这里缺少什么? 提前谢谢

1 个答案:

答案 0 :(得分:1)

要解决的几件事:

  • 从查询中删除enter code here(虽然可能是发布错误)
  • 删除占位符周围的引号
  • 添加conn.commit()

修正版:

conn = MySQLdb.connect(
    host="localhost",
    user="root",
    passwd="root",
    db="Testing",
    cursorclass = MySQLdb.cursors.SSCursor)
cursor = conn.cursor()
query = """
    UPDATE Sitemap_links
    set scanned = 1
    where id_sitemap_links = %s """
cursor.execute(query, (int(id_sitemap_links), ))
conn.commit()

请注意,通常建议将特定于数据库的功能放入管道而不是直接放入蜘蛛。