我很难抓取一个用JS渲染其所有页面的网站:https://www.jobteaser.com/en/job-offers
在使用调试器工具检查了请求之后,我看到我想要的所有内容都以。json
格式通过AJAX发送。
The file returning the content
因此,我完成了以下蜘蛛工作,以获取特定搜索的内容:
import scrapy
from scrapy import Request
import json
class JobteaserSpider(scrapy.Spider):
name="jobteaser"
start_urls=['https://www.jobteaser.com/fr/job-offers?q%3Dbusiness%20analyst%26contract%3Dstage%2Cinternship%2Cwerkstudent%26location%3DFrance..France%26locale%3Dfr%2Cen']
def parse(self,response):
apiKey="..."
header ={
"requests":[
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=*&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=abroad_only&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=company_business_type&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=company_sectors&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=contract_duration&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=contract&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%5D&hitsPerPage=20&page=0&facets=locale&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=location&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=position_category&distinct=true&facetingAfterDistinct=true"},
{"indexName":"job_offers",
"params":"query=business%20analyst&facetFilters=%5B%5B%22contract%3Astage%22%2C%22contract%3Ainternship%22%2C%22contract%3Awerkstudent%22%5D%2C%5B%22location%3AFrance%22%5D%2C%5B%22locale%3Afr%22%2C%22locale%3Aen%22%5D%5D&hitsPerPage=20&page=0&facets=start_date&distinct=true&facetingAfterDistinct=true"},
],
"apiKey":apiKey
}
yield scrapy.Request(
url="https://9vcp793ivh-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=9VCP793IVH",
method='POST',
body=json.dumps(header),
headers={'Content-Type':'application/json'},
callback=self.parse_internship)
def parse_internship(self,response):
yield{"E":response.body}
pass
标头也以.json
格式发送。 USER_AGENT
已更改,并且ROBOTSTXT_OBEY
设置为False
。尽管采取了这些措施,我还是遇到了这个错误:
DEBUG: Crawled (200) <GET https://www.jobteaser.com/fr/job-offers?q%3Dbusiness%20analyst%26contract%3Dstage%2Cinternship%2Cwerkstudent%26location%3DFrance..France%26locale%3Dfr%2Cen> (referer: None)
DEBUG: Crawled (400) <POST https://9vcp793ivh-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=9VCP793IVH> (referer: https://www.jobteaser.com/)
INFO: Ignoring response <400 https://9vcp793ivh-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=9VCP793IVH>: HTTP status code is not handled or not allowed
INFO: Closing spider (finished)
可能是通过请求发送的URL不好,但是在对原始URL进行全面分析之后,我找不到正确的URL。
谢谢!
答案 0 :(得分:0)
好吧,这比我想象的要容易得多,只需要从显示json
数据的另一个页面中检索apiKey即可。然后,当设置了正确的apiKey时,页面将发送所需的内容。