scrapy - 在新行中输入输出

时间:2017-08-27 06:48:59

标签: python web-scraping scrapy

输出就是这样的: enter image description here

因为HTML代码是这样的:

enter image description here

我不能将数据分开。任何人都可以告诉我如何做到这一点。

这是我的代码:

# -*- coding: utf-8 -*-
import scrapy
class MonsterComSpider(scrapy.Spider):
name = 'monsterca'
#allowed_domains = ['www.monster.ca']
start_urls = ['https://www.monster.ca/jobs/search/?q=data-analyst&page=1']
def parse(self, response):
    urls = response.css('div.jobTitle > h2 > a::attr(href)').extract()

    for url in urls:
        yield scrapy.Request(url = url, callback = self.parse_details)

#crawling all the pages
    next_page_url = response.xpath('//head/link[@rel="next"]/@href').extract_first()

    if next_page_url:
       next_page_url = response.urljoin(next_page_url) 
       yield scrapy.Request(url = next_page_url, callback = self.parse)            


def parse_details(self,response):
     if response.css('div[id = JobDescription] > span[id = TrackingJobBody] > ul'):
          yield {         
                  'Job Post' : response.css('div.opening.col-sm-12 > h1::text').extract_first(),
                  'Location' : response.css('div.opening.col-sm-12 > h2::text').extract_first(),
                  'Description' : response.css('div[id = JobDescription] > span[id = TrackingJobBody] > ul > li::text').extract()
                 }
     elif response.css('div[id = JobDescription] > span[id = TrackingJobBody]'):
        yield {         
                  'Job Post' : response.css('div.opening.col-sm-12 > h1::text').extract_first(),
                  'Location' : response.css('div.opening.col-sm-12 > h2::text').extract_first(),
                  'Description' : response.css('div[id = JobDescription] > span[id = TrackingJobBody]::text').extract()
                 }

我添加了if else因为monster.ca针对不同的页面有不同的布局,我想要标准化。对于这种情况,请考虑elif案例。

以下是我要点击的链接:http://job-openings.monster.ca/Senior-Data-Analyst-Calgary-AB-CA-Precision-ERP/11/186139327?MESCOID=1300087001001&jobPosition=2

或者,如果有人可以告诉你如何删除输出中的那些特殊字符并获取新行中特殊字符的后部分。感谢

1 个答案:

答案 0 :(得分:1)

在这种情况下,我不喜欢使用CSS。我宁愿使用XPath来获取文本部分。所以这里有可能的解决方案

'Description' : "\n".join(response.css('div[id = JobDescription] > span[id = TrackingJobBody] *::text').extract())

使用xpath我会使用

'Description' : "\n".join(response.css('div[id = JobDescription] > span[id = TrackingJobBody]').xpath(".//text()").extract())