我尝试从这个网站提取工作机会信息,这是我的代码
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
name = "myspider"
allowed_domains =["tanitjobs.com/"]
start_urls =["http://tanitjobs.com/search-results-jobs/"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
items = []
item = DmozItem()
for site in sites:
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('div[@class="descriptionjob"]/text()').extract()
items.append(item)
return items
但结果不正确(空项目列表):
{'desc': [],
'link': [u'lien'],
'title': []}
和这样的很多街区......
答案 0 :(得分:2)
item = DmozItem()
,否则您将重写相同的项目,将相同的项目附加到items
列表
应该看起来像:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
name = "myspider"
allowed_domains =["tanitjobs.com/"]
start_urls =["http://tanitjobs.com/search-results-jobs/"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('div[@class="descriptionjob"]/text()').extract()
items.append(item)
return items
答案 1 :(得分:0)
您的标题xpath没有考虑文本两侧的<strong>
标记,而您的desc xpath需要沿着另一个div去查找所需的信息。
我刚注意到作业描述的xpath有所不同。下面代码中的xpath返回前三个结果的作业描述,但不返回后续结果。您需要检查后续结果,以确定xpath如何更改以检索这些作业的描述。
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('normalize-space(a/strong/text())').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('normalize-space(./div/div[@class="descriptionjob"]/text())').extract()
items.append(item)
return items