如何抓取/ html / head / script字段?

时间:2018-12-21 15:22:51

标签: css xpath scrapy

我是编程和剪贴的新手。有什么方法可以刮取它们,而不仅仅是加载页面并将其拆开?

示例:

> <script> window.initialState =
> {"ACCOUNT":{"type":"PRODUCTUNIQUE","universe":"Woman","sku":"M1286ZTDT_M884_TU","code":"M1286ZTDT_M884","price":{"value":2950,"currency":"USD"},"status":"NOTFORSALE","eReservation":false,"hasSizeGuide":false,"tracking":[{"events":["addToCart"],"addToCartType":"regular","pageType":"CDC_ProductPage","ecommerce":{"currencyCode":"USD","add":{"products":{"id":"M1286ZTDT_M884_TU","name":"dior
> book tote toile de jouy bag","price":2950,"brand":"Dior Book
> Tote","category":"women/handbags/shopping bags/dior book
> tote","variant":"Multi-coloured","quantity":1,"dimension16":"M1286ZTDT_M884","dimension32":"not
> engraved"}}}}]},{"type":"PRODUCTSECTIONDESCRIPTION","sections":[{"title":"THE
> DESCRIPTION","content":"Dior Book Tote bag in canvas embroidered with
> a multi-coloured Toile de Jouy motif.<br /><br />Reference :
> M1286ZTDT_M884","type":"TEXT"},{"title":"THE
> CHARACTERISTICS","content":"Carried in the hand or on the shoulder <br
> />\nDimensions: 41.5 x 32 x 5
> cm","type":"TEXT"}]},{"type":"PRODUCTDECLINATIONS","declinations":[{"title":"Dior
> Book Tote Toile de Jouy
> bag","color":"Blue","colorCode":"33","uri":"/couture/en_us/horizon/products/couture-M1286ZTDT_M928_TU-dior-book-tote-toile-de-jouy-bag","image":{"target":"DESKTOP","uri":"https://wwws.dior.com/couture/ecommerce/media/catalog/product/cache/1/grid_image_1/460x497/17f82f742ffe127f42dca9de82fb58b1/M/1/1540309423_M1286ZTDT_M928_E01_GH.jpg","width":460,"height":497,"alt":"Click
> here to enlarge the product picture Dior Book Tote Toile de Jouy
> bag"}},{"title":"Dior Book Tote Toile de Jouy
> bag","color":"Burgundy","colorCode":"44","uri":"/couture/en_us/horizon/products/couture-M1286ZTDT_M974_TU-dior-book-tote-toile-de-jouy-bag","image":
> <a...... </script>

================================================ =======================

1 个答案:

答案 0 :(得分:0)

您可以使用与定位其他元素相同的方式来定位这些script元素-例如,使用xpaths和css选择器:

script_text = response.xpath("//script[contains(., 'window.initialState')]").extract_first()

然后,为了从脚本文本中提取有用的数据,您可以采用不同的方法-一种常用的方法是使用正则表达式从脚本中提取所需的对象(数组或对象/字典)文本,然后通过json.loads()将其加载到Python数据结构中。

另一种方法是使用诸如slimit之类的JS解析器,它可以在JavaScript代码上为您提供类似ast的界面。这是working example of using slimit