我对scrapy相对较新,并且获得了很多例外...... 这是我想要做的:
我想从中获取数据的4个嵌套链接: 让我们说我有5件我想要抓取的项目。这些项目是
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
使用href链接,我现在到达另一个包含Contact_First_Name和Contact_Last_Name的页面
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
以下是正在运行的代码。 Anzel的建议确实有所帮助,但我意识到子类allowed_domains是错误的,这阻止了嵌套链接的跟进。一旦我改变它,它的工作原理。
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[@id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('@href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[@id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('@href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[@id="mn-memberinfo-block-website"]/a/@href').extract()
# OK, finally assume you're done, just return the item object
return item
答案 0 :(得分:1)
您在代码中犯了很多错误,因此它没有按预期运行。请参阅我的以下简要示例,了解如何获取所需的项,并将 meta 传递给其他回调。我没有复制你的xpath,因为我只是从网站上获取最直接的一个,你可以申请自己的。
我会尽可能明确地发表评论,让你知道你哪里做错了。
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[@id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('@href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[@class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('@href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[@class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
希望这是不言自明的,你要理解 scrapy 的基本原则,你应该阅读彻底查看来自Scrapy的文档,并且你很快就会学习另一种方法来设置规则以跟随某些模式的链接......当然,一旦你获得基本权利,你就会理解它们。
尽管每个人的旅程都有所不同,但我强烈建议您继续阅读和练习,直到您在抓取实际网站之前对自己所做的事情充满信心。此外,还有一些规则可以保护可以删除的网页内容,以及有关您抓取的内容的版权。
请记住这一点,否则您将来可能会遇到大麻烦。无论如何,祝你好运,我希望这个答案可以帮助你解决问题!