Question

我对Python和Scrapy很陌生，到目前为止，这个网站对我的项目来说是一个非常宝贵的资源，但是现在我遇到了一个似乎很简单的问题。我可能会以错误的方式思考它。我想要做的是在我的输出CSV中添加一列，列出从中删除每行数据的URL。换句话说，我希望表格看起来像这样：

item1    item2    item_url
a        1        http://url/a
b        2        http://url/a
c        3        http://url/b
d        4        http://url/b

我正在使用psycopg2来获取存储在数据库中的一堆url然后我从中获取。代码如下所示：

class MySpider(CrawlSpider):
    name = "spider"

    # querying the database here...

    #getting the urls from the database and assigning them to the rows list
    rows = cur.fetchall()

    allowed_domains = ["www.domain.com"]

    start_urls = []

    for row in rows:

        #adding the urls from rows to start_urls
        start_urls.append(row)

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("a bunch of xpaths here...")
            items = []
            for site in sites:
                item = SettingsItem()
                # a bunch of items and their xpaths...
                # here is my non-working code
                item['url_item'] = row
                items.append(item)
            return items

正如您所看到的，我想创建一个只接受解析函数当前所在URL的项目。但是当我运行蜘蛛时，它给了我“exceptions.NameError：全局名称'row'未定义。”我认为这是因为Python不会将行识别为XPathSelector函数中的变量，或类似的东西？（就像我说的，我是新人。）无论如何，我被困住了，任何帮助都会非常感激。

Answer 1

将开始请求生成放在类主体中，但放在start_requests()：

中

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

如何将正在刮取的网址分配给某个项目？

1 个答案: