Scrapy Getting Start_Urls

时间:2015-01-24 01:39:21

标签: python mysql scrapy web-crawler pipeline

好的,要保持这个简短,需要赶紧去参加会议

我试图在scrapy中获得开始网址,无论我如何尝试,我似乎无法实现它。这是我的代码(蜘蛛)。

import scrapy
import csv

from scrapycrawler.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request


class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["snipplr.com"]


def start_requests(self):
    for i in range(1, 230):
        yield self.make_requests_from_url("http://www.snipplr.com/view/%d" % i)




def make_requests_from_url(self, url):
    item = DmozItem()

    # assign url
    item['link'] = url
    request = Request(url, dont_filter=True)

    # set the meta['item'] to use the item in the next call back
    request.meta['item'] = item
    return request


#Rules only apply before
rules = (
    Rule (LxmlLinkExtractor(deny_domains=('http:\/\/www.snipplr.com\/snippet-not-found\/',)),callback="parse", follow= True),
)


def parse(self, response):
    sel = Selector(response)
    item = response.meta['item']
    item['title'] = sel.xpath('//div[@class="post"]/h1/text()').extract()
    #start_url
    item['link'] = response.url

我已经尝试过所有方法,直到现在,我得到了一个" h"在我的数据库中,url列。

this

这是我的数据库:

import csv
from scrapy.exceptions import DropItem
from scrapy import log
import sys
 import mysql.connector

class CsvWriterPipeline(object):

    def __init__(self):
        self.connection = mysql.connector.connect(host='localhost', user='ws', passwd='ps', db='ws')
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):
        self.cursor.execute("SELECT title,url FROM items WHERE title= %s", item['title'])
        result = self.cursor.fetchone()
        if result:

            log.msg("Item already in database: %s" % item, level=log.DEBUG)
        else:
            self.cursor.execute(
               "INSERT INTO items (title, url) VALUES (%s, %s)",
                    (item['title'][0], item['link'][0]))
            self.connection.commit()

            log.msg("Item stored : " % item, level=log.DEBUG)
        return item

    def handle_error(self, e):
            log.err(e)

你可以从这里看到, here 它显然在起作用。

我如何获得开始网址,或者我将如何获取它。我相信h意味着该领域是空的。数据库是mysql。

感谢您的阅读和帮助

此致 查理

1 个答案:

答案 0 :(得分:1)

item['link']相对,

item['title']只是一个字符串,而不是列表:

self.cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
                    (item['title'][0], item['link']))