Question

我想使用scrapy在网站https://www.germanystartupjobs.com中发布所有工作。当POST请求加载作业时，我放了start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']。我在URL标签的第1页中使用network使用method:POST找到了此Chrome dev tool。

我认为在第二页中，我会得到不同的URL，但似乎并非如此。我也试过

start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

生成包含无效帮助的索引的更多页面。我的代码的当前版本在这里：

import scrapy
import json
import re
import textwrap 


class GermanyStartupJobs(scrapy.Spider):

    name = 'gsjobs'
    start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

    def parse(self, response):

        data = json.loads(response.body)
        html = data['html']
        selector = scrapy.Selector(text=data['html'], type="html")
        hrefs = selector.xpath('//a/@href').extract()

        print "LENGTH = ", len(hrefs)

        for href in hrefs:
            yield scrapy.Request(href, callback=self.parse_detail)


    def parse_detail(self, response):

        try:
            full_d  = str(response.xpath\
                ('//div[@class="col-sm-5 justify-text"]//*/text()').extract()) 

            full_des_li = full_d.split(',')
            full_des_lis = []

            for f in full_des_li:
                ff = "".join((f.strip().replace('\n', '')).split())
                if len(ff) < 3:
                    continue 
                full_des_lis.append(f)

            full = 'u'+ str(full_des_lis)

            length = len(full)
            full_des_list = textwrap.wrap(full, length/3)[:-1]

            full_des_list.reverse()


            # get the job title             
            try:
                title = response.css('.job-title').xpath('./text()').extract_first().strip()
            except:
                print "No title"
                title = ''

            # get the company name
            try:
                company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
            except:
                print "No company name"
                company_name = ''


            # get the company location  
            try:
                company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
            except:
                print 'No company location'
                company_location = ''

            # get the job poster email (if available)            
            try:
                pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)

                for text in full_des_list:
                    email = pattern.findall(text)[-1]
                    if email is not None:
                        break   
            except:
                print 'No email'
                email = ''

            # get the job poster phone number(if available)                        
            try:
                r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
                phone = r.findall(full_des_list[0])[-1]

                if phone is not None:
                    phone = '+49-' +phone

            except:
                print 'no phone'
                phone = ''

            yield {
                'title': title,
                'company name': company_name,
                'company_location': company_location, 
                'email': email,
                'phone': phone,
                'source': u"Germany Startup Job" 
            }

        except:
            print 'Not valid'
            # raise Exception("Think better!!")

我想从网站的至少前17页获得类似的信息。我怎样才能实现这一目标并改进我的代码？获得所需信息后，我计划使用multi-threading加快流程，nltk搜索海报名称（如果有）。

Answer 1

您必须实际弄清楚客户端和服务器之间的数据如何通过查看内容来抓取网站。您想要的数据页面可能无法在URL中表达。

您是否分析过网站访问网站时所建立的网络连接？它可能会从您可以访问的URL中提取内容，以便以计算机可读的方式检索数据。这比抓取网站要容易得多。

如何找到网站上列出的所有职位？

1 个答案: