Scrapy Extract标签的值

时间:2017-01-31 15:48:36

标签: json web-scraping scrapy

我正在处理一个以div作为值的json请求。 现在我想只获取data-content-value

的值
<li id="term_100800962"  data-content-value='{"nl_term_id":100800962,"c_price_from":33415,"nd_price_discount":0,"nl_tour_id":1017864,"nl_hotel_id":[49316],"d_start":"2017-04-12","d_end":"2017-04-17"}' >

并将其存储在&#39;日期&#39; &#39; ID&#39; &#39;价格&#39;而且我无法找到一种方法来做到这一点。

有简单的方法吗?

1 个答案:

答案 0 :(得分:3)

label.Layer.Sublayers[0].RemoveFromSuperLayer();

首先,获取属性的字符串,然后使用In [2]: from scrapy.selector import Selector In [3]: text = """<li id="term_100800962" data-content-value='{"nl_term_id":100 ...: 800962,"c_price_from":33415,"nd_price_discount":0,"nl_tour_id":1017864," ...: nl_hotel_id":[49316],"d_start":"2017-04-12","d_end":"2017-04-17"}' >""" In [4]: sel = Selector(text=text) In [5]: data_string = sel.xpath('//li/@data-content-value').extract_first() In [6]: import json In [7]: json.loads(data_string) Out[7]: {'c_price_from': 33415, 'd_end': '2017-04-17', 'd_start': '2017-04-12', 'nd_price_discount': 0, 'nl_hotel_id': [49316], 'nl_term_id': 100800962, 'nl_tour_id': 1017864} 将其转换为python dict。

这个url会返回一个json响应,我们应该加载所有对json的响应并选择我们需要的信息:

json.loads()

出:

In [11]: fetch('https://dovolena.invia.cz/direct/tour_search/ajax-next-boxes/?nl
...: _country_id%5B0%5D=28&nl_locality_id%5B0%5D=19&d_start_from=23.01.2017&
...: d_end_to=19.04.2017&nl_transportation_id%5B0%5D=3&sort=nl_sell&page=1&g
...: etOptionsCount=true&base_url=https%3A%2F%2Fdovolena.invia.cz%2F')

In [12]: j = json.loads(response.text)
In [15]: j['boxes_html']  # this will renturn the html in json file.
In [15]: from scrapy.selector import Selector

In [16]: sel = Selector(text=j['boxes_html'])  # loads html to selector

In [17]: datas = sel.xpath('//li/@data-content-value').extract() # return all data in a list
In [21]: [json.loads(d) for d in datas]  # loads text to value
          |---dict-----|
# this will return a list of dict which generated by json.loads(d), and you can use json.loads(d)['d_end'] to access it's element.