我想使用scrapy在网站https://www.germanystartupjobs.com
中发布所有工作。当POST请求加载作业时,我放了start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']
。我在URL
标签的第1页中使用network
使用method:POST
找到了此Chrome dev tool
。
我认为在第二页中,我会得到不同的URL
,但似乎并非如此。我也试过
start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
生成包含无效帮助的索引的更多页面。我的代码的当前版本在这里:
import scrapy
import json
import re
import textwrap
class GermanyStartupJobs(scrapy.Spider):
name = 'gsjobs'
start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
def parse(self, response):
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
hrefs = selector.xpath('//a/@href').extract()
print "LENGTH = ", len(hrefs)
for href in hrefs:
yield scrapy.Request(href, callback=self.parse_detail)
def parse_detail(self, response):
try:
full_d = str(response.xpath\
('//div[@class="col-sm-5 justify-text"]//*/text()').extract())
full_des_li = full_d.split(',')
full_des_lis = []
for f in full_des_li:
ff = "".join((f.strip().replace('\n', '')).split())
if len(ff) < 3:
continue
full_des_lis.append(f)
full = 'u'+ str(full_des_lis)
length = len(full)
full_des_list = textwrap.wrap(full, length/3)[:-1]
full_des_list.reverse()
# get the job title
try:
title = response.css('.job-title').xpath('./text()').extract_first().strip()
except:
print "No title"
title = ''
# get the company name
try:
company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
except:
print "No company name"
company_name = ''
# get the company location
try:
company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
except:
print 'No company location'
company_location = ''
# get the job poster email (if available)
try:
pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for text in full_des_list:
email = pattern.findall(text)[-1]
if email is not None:
break
except:
print 'No email'
email = ''
# get the job poster phone number(if available)
try:
r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
phone = r.findall(full_des_list[0])[-1]
if phone is not None:
phone = '+49-' +phone
except:
print 'no phone'
phone = ''
yield {
'title': title,
'company name': company_name,
'company_location': company_location,
'email': email,
'phone': phone,
'source': u"Germany Startup Job"
}
except:
print 'Not valid'
# raise Exception("Think better!!")
我想从网站的至少前17页获得类似的信息。我怎样才能实现这一目标并改进我的代码?获得所需信息后,我计划使用multi-threading
加快流程,nltk
搜索海报名称(如果有)。
答案 0 :(得分:-1)
您必须实际弄清楚客户端和服务器之间的数据如何通过查看内容来抓取网站。您想要的数据页面可能无法在URL中表达。
您是否分析过网站访问网站时所建立的网络连接?它可能会从您可以访问的URL中提取内容,以便以计算机可读的方式检索数据。这比抓取网站要容易得多。