我对Python和Scrapy很陌生,到目前为止,这个网站对我的项目来说是一个非常宝贵的资源,但是现在我遇到了一个似乎很简单的问题。我可能会以错误的方式思考它。我想要做的是在我的输出CSV中添加一列,列出从中删除每行数据的URL。换句话说,我希望表格看起来像这样:
item1 item2 item_url
a 1 http://url/a
b 2 http://url/a
c 3 http://url/b
d 4 http://url/b
我正在使用psycopg2来获取存储在数据库中的一堆url然后我从中获取。代码如下所示:
class MySpider(CrawlSpider):
name = "spider"
# querying the database here...
#getting the urls from the database and assigning them to the rows list
rows = cur.fetchall()
allowed_domains = ["www.domain.com"]
start_urls = []
for row in rows:
#adding the urls from rows to start_urls
start_urls.append(row)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("a bunch of xpaths here...")
items = []
for site in sites:
item = SettingsItem()
# a bunch of items and their xpaths...
# here is my non-working code
item['url_item'] = row
items.append(item)
return items
正如您所看到的,我想创建一个只接受解析函数当前所在URL的项目。但是当我运行蜘蛛时,它给了我“exceptions.NameError:全局名称'row'未定义。”我认为这是因为Python不会将行识别为XPathSelector函数中的变量,或类似的东西? (就像我说的,我是新人。)无论如何,我被困住了,任何帮助都会非常感激。
答案 0 :(得分:2)
将开始请求生成放在类主体中,但放在start_requests()
:
class MySpider(CrawlSpider):
name = "spider"
allowed_domains = ["www.domain.com"]
def start_requests(self):
# querying the database here...
#getting the urls from the database and assigning them to the rows list
rows = cur.fetchall()
for url, ... in rows:
yield self.make_requests_from_url(url)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("a bunch of xpaths here...")
for site in sites:
item = SettingsItem()
# a bunch of items and their xpaths...
# here is my non-working code
item['url_item'] = response.url
yield item