在scrapy中逐个选择结果

时间:2017-12-30 05:37:15

标签: python web-scraping scrapy

我从Indeed下载了一页的来源,我正试图从那里获取所有的职位,因为我正在使用这个xpath:

response.xpath('//*[@class="  row  result"]//*[@class="jobtitle"]//text()').extract()

问题是结果不是一行,因此得到了这个结果:

[u'\n    ',
 u'Data',
 u' ',
 u'Scientist',
 u' Experto SQL con conocimiento en R',
 u'\n    ',
 u'\n    ',
 u'Data',
 u' Analytic con Python',
 u'\n    ',
 u'\n    ',
 u'Data',
 u' Analytic con R',

与其他数据进行映射存在问题,我想要的是逐个选择处理作业,类似于extract_first()

response.xpath('//*[@class="  row  result"]').extract_first()

但是对于任何给定的索引并且可以选择继续处理数据。 我试过这个:

current_job = response.xpath('//*[@class="  row  result"]').extract_first()
current_job = TextResponse(url='',body=current_job,encoding='utf-8') 

但它只适用于第一个结果,它对我来说看起来不像是一个pythonic方法。

2 个答案:

答案 0 :(得分:2)

首先,我只会a(没有text()extract()),然后我会使用for来使用text()extract()每个a单独使用,join()将元素连接到带标题的字符串。

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://www.indeed.cl/trabajo?q=Data%20scientist&l=']

    def parse(self, response):
        print('url:', response.url)

        results = response.xpath('//h2[@class="jobtitle"]/a')
        print('number:', len(results))

        for item in results:
            title = ''.join(item.xpath('.//text()').extract())
            print('title:', title)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(MySpider)
c.start()

结果:

number: 10
title: Data Scientist
title: CONSULTOR DATA SCIENCE SANTIAGO DE CHILE
title: Líder Análisis de Datos MCoE Minerals Americas
title: Ingeniero Inteligencia Mercado, BI
title: Ingeniero Inteligencia de Mercado, Business Intelligence
title: Data Scientist
title: Data Scientist
title: Data Scientist (Machine Learning)
title: Data Scientist / Ml Scientist
title: Young Professional - Spanish LatAm

答案 1 :(得分:1)

试一试。您需要稍微更改我的脚本以适合您的项目。它可以解决您上面提到的问题。

import requests
from scrapy import Selector

res = requests.get("https://www.indeed.cl/trabajo?q=Data%20scientist")
sel = Selector(res)
for item in sel.css("h2.jobtitle a"):
    title = ' '.join(item.css("::text").extract())
    print(title)

输出:

Data   Scientist
CONSULTOR  DATA  SCIENCE SANTIAGO DE CHILE
Líder Análisis de Datos MCoE Minerals Americas
Ingeniero Inteligencia Mercado, BI
Ingeniero Inteligencia de Mercado, Business Intelligence
Data   Scientist
Data   Scientist
Young Professional - Spanish LatAm
Data   Scientist  (Machine Learning)
Data   Scientist  / Ml  Scientist