Scrapy从请求URL获取错误的值

时间:2017-02-26 15:16:09

标签: python web-scraping scrapy

我正在尝试从This.中提取标题但是获得不同的标题,这不是resposnse url的标题。 我正在尝试这个 -

class ElementSpider(scrapy.Spider):
    name = 'qwerty4'
    allowed_domains = ["burbank.com.au"]
    start_urls = ["https://www.burbank.com.au/victoria/home-details/alphington-153-179727", "https://www.burbank.com.au/victoria/home-details/sandringham-151-171569", "https://www.burbank.com.au/victoria/home-details/sandringham-151-181680", "https://www.burbank.com.au/victoria/home-details/bellfield-184-171585", "https://www.burbank.com.au/victoria/home-details/carlton-178-172662", "https://www.burbank.com.au/victoria/home-details/carlton-178-178079" ]

    def parse(self, response):
        title = response.xpath('//div[@class="col-md-4 col-xs-12 col-sm-12"]/div[@class="housename"]/span/text()').extract()[0]
        print response.url
        print title

并获取某些请求的错误数据。输出是 - enter image description here

请建议如何解决此问题。

2 个答案:

答案 0 :(得分:1)

他们不希望他们的网站被刮掉,所以添加了一种让刮刀混淆的技术。

在settings.py中更改一些字段。

for i in range(random.randint(1,4)):
    xos=[150,200,250,300,350,400,450,500,550,600,650,700,750,800,850]
    yos=[150,200,250,300,350,400,450,500,550,600,650]
    xos_=random.choice(xos)
    yos_=random.choice(yos)
    object=canvas.create_image(xos_,yos_,image=postava)
    read_=read.replace("[","").replace("]","").replace("'","").replace("\\n","").replace("\\","")
    loot.write(read_+"\n")

答案 1 :(得分:0)

似乎网站存储viewstate。

要解决这个问题,您需要通过设置CONCURRENT_REQUESTS = 1来摆脱scrapy的并发性。

否则你需要进一步研究如何生成视图状态,它可能是IP绑定的,这可能意味着你需要一些代理来解决这个问题。