scrapy:从javascript生成的表单中抓取数据

时间:2015-07-07 04:28:32

标签: javascript python scrapy

我指的是stackoverflow上列出的以下问题: Scrapy, scrapping data inside a javascript

我试图复制@Rho给出的这个问题的答案,以学习如何从javascript生成的表单中抓取数据。自问题发布以来,表单的有效负载似乎已经发生了变化,因此我进行了相应的修改。

我的代码和输出如下:

>>>scrapy shell https://www.mcdonalds.com.sg/locate-us/

2015-07-07 12:09:28+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: scrapybot)
.....
2015-07-07 12:09:28+0800 [default] INFO: Spider opened
2015-07-07 12:09:32+0800 [default] DEBUG: Crawled (200) <GET https://www.mcdonalds.com.sg/locate-us/> (referer: None)
....
>>> url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
>>> payload = {'action':'store_locator_locations'}
>>> head = {'X-Requested-With':'XMLHttpRequest'}
>>> from scrapy.http import FormRequest
>>> req=FormRequest(url,formdata=payload,headers=head)
>>> fetch(req)
2015-07-07 12:12:24+0800 [default] DEBUG: Crawled (404) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)

预期回复为200,但正如您在上面看到的,我收到404错误代码。

1 个答案:

答案 0 :(得分:0)

这不是代码本身的问题。您提到的原始问题和答案来自 2013 ;一生以前在互联网上。

麦当劳新加坡的情况发生了变化,而对于Wordpress来说似乎也是如此。但不是那么多。

过去是什么

url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'

现在是

url = 'https://www.mcdonalds.com.sg/wp/wp-admin/admin-ajax.php'

(我通过使用Chrome F12开发人员工具并查看“网络”标签找到了这一点)

事实上,您可以向此网址发出GET请求并获取JSON:

  

GET

     

https://www.mcdonalds.com.sg/wp/wp-admin/admin-ajax.php?action=store_locator_locations

[{
    "id": "417",
    "name": "McDonald\u2019s JCube",
    "address": "2 Jurong East Central 1<br\/>#01-09<br\/>JCube\r\n",
    "city": "Singapore",
    "lat": "1.33352",
    "long": "103.740277",
    "op_hours": "Mon-Fri: Opens at 0630<br>\r\nSat-Sun: Opens at 0700<br>\r\nSun-Thur: Closes at 2300 <br>\r\nFri\/Sat & PH Eve: Closes at 0000\r\n<br><br>\r\nDessert Kiosk: Daily 1100 - 2300",
    "phone": "66844228",
    "region": "west",
    "types": ["3"],
    "zip": "609731"
},
...
]