我正在使用scrapy从以下网站抓取并抓取数据:
以下是我的目标:
我能够做到所有3,但却陷入了抓取一些数据的困境。 例如,下面是我想要抓取的页面的链接:
我可以使用以下xpath从页面顶部抓取职位名称,公司名称和位置:
item['Company'] = response.xpath('//span[@class = "ib"]/text()').extract()
item['jobTitle'] = response.xpath('//div[@class = "header cell info"]/h2/text()').extract()
item['Location'] = response.xpath('//span[@class = "subtle ib"]/text()').extract()
但是,我无法从"公司信息"中删除信息。部分。 以下是我的网站,规模,总部和行业的代码:
item['Website'] = response.xpath('//div[@id="InfoDetails"]/div[1]/span[@class = "empData website"]/a/@href').extract()
item['HQ'] = response.xpath('//div[@id="InfoDetails"]/div[2]/span[@class = "empData"]/text()').extract()
item['Size'] = response.xpath('//div[@id="InfoDetails"]/div[3]/span[@class = "empData"]/text()').extract()
item['Industry'] = response.xpath('//div[@id="InfoDetails"]/div[6]/span/tt/text()').extract()
我不知道为什么最后4个x路径不起作用。
感谢您的帮助。
答案 0 :(得分:0)
大多数抓取工具都不会呈现javascript。要完成渲染,您需要利用javascript渲染引擎。如果你绑定到python,我建议using Splash with scrapy as talked about here。其他工具如phantomjs可以集成到其他技术中。
答案 1 :(得分:0)
我知道我已经很晚了,但万一其他人需要它。 Glassdoor动态生成这些属性,因此我使用了splash请求来处理它们。 这是代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
id = 1
class GlassdoorData(scrapy.Spider):
name = 'glassdoordata'
#allowed_domains = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm']
start_urls = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 10},
)
def parse(self, response):
#main_url = "https://www.glassdoor.ca"
urls = response.css('li.jl > div > div.flexbox > div > a::attr(href)').extract()
for url in urls:
url = "https://www.glassdoor.ca" + url
yield SplashRequest(url = url, callback = self.parse_details,args={'wait': 10})
global id
id = id+1
#if id < 2 :
next_page_url = "https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11_IP{}.htm".format(id)
if next_page_url:
#next_page_url = response.urljoin(next_page_url)
#self.log("reached22: "+ next_page_url)
yield SplashRequest(url = next_page_url, callback = self.parse,args={'wait': 10},)
def parse_details(self,response):
yield{
'Job_Title' : response.css('div.header.cell.info > h2::text').extract_first(),
'Company' : response.css('div.header.cell.info > span.ib::text').extract_first(),
'Location' : response.css('div.header.cell.info > span.subtle.ib::text').extract_first(),
'Website' : response.xpath("//div[@class = 'infoEntity']/span/a/text()").extract(),
'Size' : response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Size')]/following-sibling::span/text()").extract(),
'Industry' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Industry')]/following-sibling::span/text()").extract_first()).lstrip(),
'Type' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Type')]/following-sibling::span/text()").extract_first()).lstrip(),
'Revenue' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Revenue')]/following-sibling::span/text()").extract_first()).lstrip(),
'Competitors' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Competitors')]/following-sibling::span/text()").extract_first()).lstrip(),
}
像这样编辑settings.py:
BOT_NAME = 'glassdoordata'
SPIDER_MODULES = ['glassdoordata.spiders']
NEWSPIDER_MODULE = 'glassdoordata.spiders'
# Obey robots.txt rules
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://192.168.99.100:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ROBOTSTXT_OBEY = False
您需要在运行此程序之前安装Splash。
由于