我在Scrapy Shell中运行以下代码,使用POST请求来抓取数据:
url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'
data = {'action': 'wpp_property_overview_pagination',
'wpp_ajax_query[show_children]': 'true',
'wpp_ajax_query[disable_wrapper]': 'true',
'wpp_ajax_query[pagination]': 'off',
'wpp_ajax_query[per_page]': '10',
'wpp_ajax_query[query][property_category]': 'residential',
'wpp_ajax_query[query][listing_type]': 'rent',
'wpp_ajax_query[query][sort_by]': 'price_rent',
'wpp_ajax_query[query][sort_order]': 'ASC',
'wpp_ajax_query[query][pagi]': '0--10',
'wpp_ajax_query[sorter]': '',
'wpp_ajax_query[sort_by]': 'price_rent',
'wpp_ajax_query[sort_order]': 'ASC',
'wpp_ajax_query[template]': 'ajax',
'wpp_ajax_query[requested_page]': '2'}
request = FormRequest(url, formdata = data)
fetch(request)
我知道响应中的内容是类"property-thumb"
的元素,我已经使用Chrome开发工具检查了它,并阅读了响应内容。所以,我尝试使用XPath //*[@class="property-thumb"]
来抓取数据,这个XPath是正确的(我使用Chrome插件来检查加载到页面中的内容),但是如果我尝试的话,它是不对的从Scrapy Shell中使用它:
In [10]: response.xpath('//*[@class="property-thumb"]')
Out[10]: []
我注意到response.body
附带了很多反斜杠,所以我发现正确的XPath应该是//*[@class=\'\\"property-thumb\\"\']
:
In [11]: response.xpath('//*[@class=\'\\"property-thumb\\"\']')
Out[11]:
[<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>,
<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>,
<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>]
我认为Scrapy管理响应字符串的方式存在问题。另外,我认为这些反斜杠在抓取时会产生更多问题。为什么会这样?如何解决它使用普通的XPath?
答案 0 :(得分:1)
有一个非常简单的解决方案,你得到json回来而不是html :
url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'
data = {'action': 'wpp_property_overview_pagination',
'wpp_ajax_query[show_children]': 'true',
'wpp_ajax_query[disable_wrapper]': 'true',
'wpp_ajax_query[pagination]': 'off',
'wpp_ajax_query[per_page]': '10',
'wpp_ajax_query[query][property_category]': 'residential',
'wpp_ajax_query[query][listing_type]': 'rent',
'wpp_ajax_query[query][sort_by]': 'price_rent',
'wpp_ajax_query[query][sort_order]': 'ASC',
'wpp_ajax_query[query][pagi]': '0--10',
'wpp_ajax_query[sorter]': '',
'wpp_ajax_query[sort_by]': 'price_rent',
'wpp_ajax_query[sort_order]': 'ASC',
'wpp_ajax_query[template]': 'ajax',
'wpp_ajax_query[requested_page]': '2'}
import requests
print(requests.post(url, data).json())
哪会给你:
{u'display': u' <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" title="Lisson Street, Marylebone, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_4427_6_large.jpg" alt="Lisson Street, Marylebone, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/">Lisson Street, Marylebone, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3420<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_4427_1_large-743x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n \n <span class="separator">|</span>\n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/4427/MED_4427_6235.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" title="Riding House Street, Fitzrovia, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/09/IMG_3453_10_large.jpg" alt="Riding House Street, Fitzrovia, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/">Riding House Street, Fitzrovia, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3425<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_3453_1_large-724x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n \n <span class="separator">|</span>\n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3453/MED_3453_6286.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" title="Grays Inn Road, Bloomsbury, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_3933_1_large.jpg" alt="Grays Inn Road, Bloomsbury, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/">Grays Inn Road, Bloomsbury, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3430<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n \n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3933/MED_3933_5539.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n ', u'wpp_query': {u'starting_row': 10, u'pagination': u'off', u'show_layout_toggle': False, u'current_page': u'2', u'requested_page': u'2', u'show_children': u'true', u'sortable_attrs': {u'menu_order': u'Default'}, u'sort_by': u'price_rent', u'sort_order': u'ASC', u'ajax_call': True, u'template': u'ajax', u'per_page': u'10', u'query': {u'sort_by': u'price_rent', u'pagi': u'10--10', u'listing_type': u'rent', u'sort_order': u'ASC', u'property_category': u'residential'}, u'sorter': u'', u'disable_wrapper': u'true', u'properties': {u'total': 60, u'results': [u'793240', u'836654', u'793035', u'793044', u'793078', u'793307', u'792965', u'793054', u'792811', u'793344']}`}}
额外的反斜杠是为了逃避引号等。一旦你json.loads()
内容超出斜线所以在你的情况下调用加载到正文:
import json
request = FormRequest(url, formdata = data)
js = json.loads(fetch(request).body)
要获得html,您可以使用密钥html = js["display"]
。