我正在尝试从This.中提取标题但是获得不同的标题,这不是resposnse url的标题。 我正在尝试这个 -
class ElementSpider(scrapy.Spider):
name = 'qwerty4'
allowed_domains = ["burbank.com.au"]
start_urls = ["https://www.burbank.com.au/victoria/home-details/alphington-153-179727", "https://www.burbank.com.au/victoria/home-details/sandringham-151-171569", "https://www.burbank.com.au/victoria/home-details/sandringham-151-181680", "https://www.burbank.com.au/victoria/home-details/bellfield-184-171585", "https://www.burbank.com.au/victoria/home-details/carlton-178-172662", "https://www.burbank.com.au/victoria/home-details/carlton-178-178079" ]
def parse(self, response):
title = response.xpath('//div[@class="col-md-4 col-xs-12 col-sm-12"]/div[@class="housename"]/span/text()').extract()[0]
print response.url
print title
请建议如何解决此问题。
答案 0 :(得分:1)
他们不希望他们的网站被刮掉,所以添加了一种让刮刀混淆的技术。
在settings.py中更改一些字段。
for i in range(random.randint(1,4)):
xos=[150,200,250,300,350,400,450,500,550,600,650,700,750,800,850]
yos=[150,200,250,300,350,400,450,500,550,600,650]
xos_=random.choice(xos)
yos_=random.choice(yos)
object=canvas.create_image(xos_,yos_,image=postava)
read_=read.replace("[","").replace("]","").replace("'","").replace("\\n","").replace("\\","")
loot.write(read_+"\n")
答案 1 :(得分:0)
似乎网站存储viewstate。
要解决这个问题,您需要通过设置CONCURRENT_REQUESTS = 1
来摆脱scrapy的并发性。
否则你需要进一步研究如何生成视图状态,它可能是IP绑定的,这可能意味着你需要一些代理来解决这个问题。