Scrapy响应将反斜杠转换为元素属性

时间:2016-09-14 23:16:04

标签: python python-2.7 xpath web-scraping scrapy

我在Scrapy Shell中运行以下代码,使用POST请求来抓取数据:

url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'

data = {'action': 'wpp_property_overview_pagination',
        'wpp_ajax_query[show_children]': 'true',
        'wpp_ajax_query[disable_wrapper]': 'true',
        'wpp_ajax_query[pagination]': 'off',
        'wpp_ajax_query[per_page]': '10',
        'wpp_ajax_query[query][property_category]': 'residential',
        'wpp_ajax_query[query][listing_type]': 'rent',
        'wpp_ajax_query[query][sort_by]': 'price_rent',
        'wpp_ajax_query[query][sort_order]': 'ASC',
        'wpp_ajax_query[query][pagi]': '0--10',
        'wpp_ajax_query[sorter]': '',
        'wpp_ajax_query[sort_by]': 'price_rent',
        'wpp_ajax_query[sort_order]': 'ASC',
        'wpp_ajax_query[template]': 'ajax',
        'wpp_ajax_query[requested_page]': '2'}

request = FormRequest(url, formdata = data)
fetch(request)

我知道响应中的内容是类"property-thumb"的元素,我已经使用Chrome开发工具检查了它,并阅读了响应内容。所以,我尝试使用XPath //*[@class="property-thumb"]来抓取数据,这个XPath是正确的(我使用Chrome插件来检查加载到页面中的内容),但是如果我尝试的话,它是不对的从Scrapy Shell中使用它:

In [10]: response.xpath('//*[@class="property-thumb"]')
Out[10]: []

我注意到response.body附带了很多反斜杠,所以我发现正确的XPath应该是//*[@class=\'\\"property-thumb\\"\']

In [11]: response.xpath('//*[@class=\'\\"property-thumb\\"\']')
Out[11]: 
[<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>,
 <Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>,
 <Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>]

我认为Scrapy管理响应字符串的方式存在问题。另外,我认为这些反斜杠在抓取时会产生更多问题。为什么会这样?如何解决它使用普通的XPath?

1 个答案:

答案 0 :(得分:1)

有一个非常简单的解决方案,你得到json回来而不是html

url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'

data = {'action': 'wpp_property_overview_pagination',
        'wpp_ajax_query[show_children]': 'true',
        'wpp_ajax_query[disable_wrapper]': 'true',
        'wpp_ajax_query[pagination]': 'off',
        'wpp_ajax_query[per_page]': '10',
        'wpp_ajax_query[query][property_category]': 'residential',
        'wpp_ajax_query[query][listing_type]': 'rent',
        'wpp_ajax_query[query][sort_by]': 'price_rent',
        'wpp_ajax_query[query][sort_order]': 'ASC',
        'wpp_ajax_query[query][pagi]': '0--10',
        'wpp_ajax_query[sorter]': '',
        'wpp_ajax_query[sort_by]': 'price_rent',
        'wpp_ajax_query[sort_order]': 'ASC',
        'wpp_ajax_query[template]': 'ajax',
        'wpp_ajax_query[requested_page]': '2'}
import requests
print(requests.post(url, data).json())

哪会给你:

{u'display': u'        <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" title="Lisson Street, Marylebone, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_4427_6_large.jpg" alt="Lisson Street, Marylebone, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/">Lisson Street, Marylebone, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3420<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                              <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_4427_1_large-743x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n      \n                                                                                                <span class="separator">|</span>\n                                                                <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/4427/MED_4427_6235.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n            <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" title="Riding House Street, Fitzrovia, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/09/IMG_3453_10_large.jpg" alt="Riding House Street, Fitzrovia, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/">Riding House Street, Fitzrovia, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3425<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                              <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_3453_1_large-724x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n      \n                                                                                                <span class="separator">|</span>\n                                                                <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3453/MED_3453_6286.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n            <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" title="Grays Inn Road, Bloomsbury, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_3933_1_large.jpg" alt="Grays Inn Road, Bloomsbury, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/">Grays Inn Road, Bloomsbury, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3430<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                        \n                                                                                            <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3933/MED_3933_5539.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n    ', u'wpp_query': {u'starting_row': 10, u'pagination': u'off', u'show_layout_toggle': False, u'current_page': u'2', u'requested_page': u'2', u'show_children': u'true', u'sortable_attrs': {u'menu_order': u'Default'}, u'sort_by': u'price_rent', u'sort_order': u'ASC', u'ajax_call': True, u'template': u'ajax', u'per_page': u'10', u'query': {u'sort_by': u'price_rent', u'pagi': u'10--10', u'listing_type': u'rent', u'sort_order': u'ASC', u'property_category': u'residential'}, u'sorter': u'', u'disable_wrapper': u'true', u'properties': {u'total': 60, u'results': [u'793240', u'836654', u'793035', u'793044', u'793078', u'793307', u'792965', u'793054', u'792811', u'793344']}`}}

额外的反斜杠是为了逃避引号等。一旦你json.loads()内容超出斜线所以在你的情况下调用加载到正文:

 import json

 request = FormRequest(url, formdata = data)
 js = json.loads(fetch(request).body)

要获得html,您可以使用密钥html = js["display"]