学校作业
编写自定义蜘蛛以从页面中提取多个项目-这个想法是从
中提取Job角色,公司和位置https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab
试图遵循https://www.accordbox.com/blog/scrapy-tutorial-10-how-build-real-spider/为其他网站创建蜘蛛
这是我正在使用的代码。真的不确定再在哪里进行更改
class JobDetail(Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class JobItems(Spider):
name = 'JobItems'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://stackoverflow.com/jobs']
def parse(self, response):
yield Request('https://stackoverflow.com/jobs', callback=self.parse_details)
def parse_details(self, response):
jobs = response.xpath('//div[@class="-job-summary"]')
for job in jobs:
job = JobDetail()
job['title'] = job.xpath('.//*[@class="s-link s-link__visited"]').extract()
job['company'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[1]/text()').extract()
job['location'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[2]/text()').extract()
yield jobs
上面的代码一无所获
class JobDetail(Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class JobItems(Spider):
name = 'JobItems'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://stackoverflow.com/jobs']
def parse(self, response):
yield Request('https://stackoverflow.com/jobs', callback=self.parse_details)
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 \
(KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'})
def parse_details(self, response):
jobs = response.xpath('//div[@class="-job-summary"]')
for job in jobs:
job = JobDetail()
job['title'] = job.xpath('.//*[@class="s-link s-link__visited"]').extract()
job['company'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[1]/text()').extract()
job['location'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[2]/text()').extract()
yield jobs
process.crawl(JobItems)
零使用上面的代码对项目进行爬网
2019-07-06 20:47:06 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-07-06 20:47:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic
2019-07-06 20:47:06 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 \\\n (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'}
2019-07-06 20:47:06 [scrapy.extensions.telnet] INFO: Telnet Password: dc701a1b667b9026
2019-07-06 20:47:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2019-07-06 20:47:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-06 20:47:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-06 20:47:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-06 20:47:06 [scrapy.core.engine] INFO: Spider opened
2019-07-06 20:47:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-06 20:47:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6028
<Deferred at 0x7f595e7570f0>
根据@abdusco的建议进行更改-输出相同
class JobDetail(Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class JobItems(Spider):
name = 'JobItems'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://stackoverflow.com/jobs']
def parse_details(self, response):
jobs = response.xpath('//div[@class="-job-summary"]')
for job in jobs:
job = JobDetail()
job['title'] = job.xpath('.//*[@class="s-link s-link__visited"]').extract()
job['company'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[1]/text()').extract()
job['location'] = job.xpath('.//div[@class="fc-black-700 fs-body2 -company"]//span[2]/text()').extract()
yield jobs
def parse(self, response):
yield Request('https://stackoverflow.com/jobs', callback=self.parse_details)
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(JobItems)
答案 0 :(得分:0)
如果代码与您的帖子中的代码完全一致,则JobItems
类将不知道如何解析页面。像下面一样正确缩进代码。
另外,您yield
正在使用job
,应该改为yield job
。
class JobItems(Spider):
...
def parse(self, response):
...
def parse_details(self, response):
...
for job in jobs:
...
yield job
process = CrawlerProcess(...)
process.crawl(JobItems)
编辑:
问题是您在process.start()
之后错过了process.crawl(...)
通话。没有它,搜寻器就不会启动。
一些评论:
.get()
而不是.extract()
来获得一个值。提取物为您提供列表job
实例覆盖循环变量JobDetail
。start_requests
并改用start_urls
。 Docs 这是使用CSS选择器的有效版本:
from scrapy import Item, Spider, Request
from scrapy.crawler import CrawlerProcess
import scrapy
class JobDetail(Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class JobItems(Spider):
name = 'JobItems'
start_urls = [
'https://stackoverflow.com/jobs'
]
def parse(self, response):
for j in response.css('.-job-summary'):
job = JobDetail()
job['title'] = j.css('.s-link__visited::text').get().strip(' -\r\n')
job['company'] = j.css('.-company span::text').get().strip(' -\r\n')
job['location'] = j.css('.-company span:last-of-type::text').get().strip(' -\r\n')
yield job
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(JobItems)
process.start()
输出:
{'company': 'Horizontal Integration',
'location': 'San Diego, CA',
'title': 'Software Engineer'}
...
{'company': 'Healthy Back Institute',
'location': 'Austin, TX',
'title': 'Senior PHP / Laravel Developer'}
...
答案 1 :(得分:0)
请尝试以下脚本来获取您想要获取的内容。按原样运行。无需更改:
from scrapy.crawler import CrawlerProcess
import scrapy
class JobItems(scrapy.Spider):
name = 'JobItems'
start_urls = ['https://stackoverflow.com/jobs']
def parse(self, response):
for job in response.xpath('//div[@class="-job-summary"]'):
item = {}
item['title'] = job.xpath('.//h2/a[contains(@class,"s-link__visited")]/text()').get().strip()
item['company'] = job.xpath('.//*[contains(@class,"-company")]/span/text()').get().strip()
item['location'] = job.xpath('normalize-space(.//*[contains(@class,"-company")]/span[starts-with(@class,"fc-black-")]/text())').get().strip()
yield item
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
process.crawl(JobItems)
process.start()