Scrapy:无法在网站上搜集一些信息

时间:2015-11-20 02:10:54

标签: python xpath web-scraping web-crawler scrapy

我正在使用scrapy从以下网站抓取并抓取数据:

http://www.glassdoor.com/Job/jobs.htm?suggestCount=4&suggestChosen=true&clickSource=searchBtn&typedKeyword=data+scien&headSiteSrch=%2FJob%2Fjobs.htm&sc.keyword=data+scientist&locT=&locId=

以下是我的目标:

  1. 浏览每一页
  2. 在每个页面中,抓取所有链接结果
  3. 从#2进入每个链接并抓取数据
  4. 我能够做到所有3,但却陷入了抓取一些数据的困境。 例如,下面是我想要抓取的页面的链接:

    http://www.glassdoor.com/job-listing/lead-data-scientist-director-of-data-science-marketing-cloud-platform-affinity-solutions-JV_IC1147436_KO0,69_KE70,88.htm?jl=1537438396

    我可以使用以下xpath从页面顶部抓取职位名称,公司名称和位置:

    item['Company'] = response.xpath('//span[@class = "ib"]/text()').extract()
    item['jobTitle'] = response.xpath('//div[@class = "header cell info"]/h2/text()').extract()
    item['Location'] = response.xpath('//span[@class = "subtle ib"]/text()').extract()
    

    但是,我无法从"公司信息"中删除信息。部分。 以下是我的网站,规模,总部和行业的代码:

    item['Website'] = response.xpath('//div[@id="InfoDetails"]/div[1]/span[@class = "empData website"]/a/@href').extract()
    item['HQ'] = response.xpath('//div[@id="InfoDetails"]/div[2]/span[@class = "empData"]/text()').extract()
    item['Size'] = response.xpath('//div[@id="InfoDetails"]/div[3]/span[@class = "empData"]/text()').extract()
    item['Industry'] = response.xpath('//div[@id="InfoDetails"]/div[6]/span/tt/text()').extract()
    

    我不知道为什么最后4个x路径不起作用。

    感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

大多数抓取工具都不会呈现javascript。要完成渲染,您需要利用javascript渲染引擎。如果你绑定到python,我建议using Splash with scrapy as talked about here。其他工具如phantomjs可以集成到其他技术中。

答案 1 :(得分:0)

我知道我已经很晚了,但万一其他人需要它。 Glassdoor动态生成这些属性,因此我使用了splash请求来处理它们。 这是代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
id = 1
class GlassdoorData(scrapy.Spider):
name = 'glassdoordata'
#allowed_domains = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm']
start_urls = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm']
def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(
        url,
        self.parse,
        args={'wait': 10},
        )   

def parse(self, response):
    #main_url = "https://www.glassdoor.ca"
    urls = response.css('li.jl > div > div.flexbox > div > a::attr(href)').extract()

    for url in urls:            
            url = "https://www.glassdoor.ca" + url
            yield SplashRequest(url = url, callback = self.parse_details,args={'wait': 10})
    global id
    id = id+1
    #if id < 2 :
    next_page_url = "https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11_IP{}.htm".format(id) 
    if next_page_url:

         #next_page_url = response.urljoin(next_page_url) 
       #self.log("reached22: "+ next_page_url)

       yield SplashRequest(url = next_page_url, callback = self.parse,args={'wait': 10},) 



def parse_details(self,response):
    yield{
        'Job_Title' : response.css('div.header.cell.info > h2::text').extract_first(),
      'Company' : response.css('div.header.cell.info > span.ib::text').extract_first(),
      'Location' : response.css('div.header.cell.info > span.subtle.ib::text').extract_first(),
      'Website' : response.xpath("//div[@class = 'infoEntity']/span/a/text()").extract(),
      'Size' : response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Size')]/following-sibling::span/text()").extract(),
      'Industry' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Industry')]/following-sibling::span/text()").extract_first()).lstrip(),
      'Type' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Type')]/following-sibling::span/text()").extract_first()).lstrip(),
      'Revenue' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Revenue')]/following-sibling::span/text()").extract_first()).lstrip(),
      'Competitors' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Competitors')]/following-sibling::span/text()").extract_first()).lstrip(),

    }

像这样编辑settings.py:

BOT_NAME = 'glassdoordata'

SPIDER_MODULES = ['glassdoordata.spiders']
NEWSPIDER_MODULE = 'glassdoordata.spiders'



# Obey robots.txt rules
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
 }

 SPLASH_URL = 'http://192.168.99.100:8050'

 SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
 }

 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'



 ROBOTSTXT_OBEY = False

您需要在运行此程序之前安装Splash。

由于