我正在尝试抓取该网页的表格(https://www.ftse.com/pr产品/索引/英国)。当我检查“网络”选项卡中的页面时,我看到此页面将其数据获取到具有AJAX请求(类型POST)的API,该请求由浏览器在加载布局后完成。因此,我正在尝试构建一个蜘蛛,使用请求中提供的form_data将POST请求发送到网页。我已经使用以下shell命令进行了快速测试,并获得了数据。
curl 'https://www.ftse.com/products/indices/home/ra_getIndexData/' --data 'indexName=GEISAC¤cy=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='
但是,当我尝试使用FormRequest类在Spider上对其进行编码时,Spider会失败。
class FtseSpider(scrapy.Spider):
name = 'ftse'
#allowed_domains = ['www.ftserussell.com', 'www.ftse.com']
start_urls = [
'https://www.ftse.com/products/indices/uk']
def parse(self, request):
# URL parameters for the requst
data = 'indexName=GEISAC¤cy=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='`
# convert the URL parameters in to a dict
params_raw_ = urllib.parse.parse_qs(data)
prams_dict_ = {k: v[0] for k, v in params_raw_.items()}
# return the response
yield [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
method='POST',
body=prams_dict_)]
答案 0 :(得分:1)
由于数据具有嵌套字典,因此不能将其以scrapy形式表示为formdata,因此必须在请求正文中传递json转储,该转储等于“ data”的初始表示。产生迭代器时也可以使用yield from,或者使用单个对象或Request来产生。
yield from [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
method='POST', body=data)]