我正在使用scrapy抓取页面,我能够获得所有可见文本的简单内容。但是,有些文本对爬虫不可见,最终显示为空格。
例如,查看页面源可以让我看到这些字段:
https://www.dropbox.com/s/f056mffmuah6uu4/Screenshot%202015-07-23%2018.23.32.png?dl=0
我已多次尝试通过xpath和css访问此字段,并且无法在每次尝试后获取这些字段。
当我尝试类似的事情时:
response.xpath('//text()').extract()
我根本无法在文本转储中看到这些字段。
有没有人知道为什么这些字段对scrapy不可见?该网站是:https://www.buzzbuzzhome.com/uc/units/houses/sapphire
答案 0 :(得分:1)
在您的蜘蛛中,您需要向https://www.buzzbuzzhome.com/bbhAjax/Development/UnitPriceHistory
端点发出额外的XHR POST请求,以获取提供必要标头和POST参数的价格历史记录:
import json
import scrapy
class BuzzSpider(scrapy.Spider):
name = 'buzzbuzzhome'
allowed_domains = ['buzzbuzzhome.com']
start_urls = ['https://www.buzzbuzzhome.com/uc/units/houses/sapphire']
def parse(self, response):
unit_id = response.xpath("//div[@id = 'unitDetails']/@data-unit-id").extract()[0]
development_url = "uc"
new_relic_id = response.xpath("//script[contains(., 'xpid')]").re(r'xpid:"(.*?)"')
params = {"developmentUrl": development_url, "unitID": unit_id}
yield scrapy.Request("https://www.buzzbuzzhome.com/bbhAjax/Development/UnitPriceHistory",
method="POST",
body=json.dumps(params),
callback=self.parse_history,
headers={
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"X-NewRelic-ID": new_relic_id,
"Origin": "https://www.buzzbuzzhome.com",
"Host": "www.buzzbuzzhome.com",
'Content-Type': 'application/json; charset=UTF-8'
})
def parse_history(self, response):
for row in response.css("div.row"):
title = row.xpath(".//div[@class='content-title']/text()").extract()[0].strip()
text = row.xpath(".//div[@class='content-text']/text()").extract()[0].strip()
print title, text
打印:
05/25/2015 Unit listed as Sold
12/18/2014 Unit listed as For Sale
11/24/2014 Unit price increased by 1.54% to $461,990
11/04/2014 Unit price increased by 6.81% to $454,990
10/02/2014 Unit price increased by 4.67% to $425,990
01/22/2014 Unit price increased by 2.52% to $406,990
12/06/2013 Unit listed as For Sale at $396,990