我想构建一个jsonObj的字典。这是我到目前为止所拥有的。我还没弄明白如何提取json来解析它。
def parse_store(self, response):
jsonobj = response.xpath('//script[@window.appData//text').extract()
stores = json.loads(jsonobj.body_as_unicode())
print(stores)
for stores in response:
stores = {}
stores['stores'] = response['stores']
stores['stores']['id'] = response['stores']['id']
stores['stores']['name'] = response['stores']['name']
stores['stores']['addr1'] = response['stores']['addr1']
stores['stores']['city'] = response['stores']['city']
stores['stores']['state'] = response['stores']['state']
stores['stores']['country'] = response['stores']['country']
stores['stores']['zipCode'] = response['stores']['zipCode']
stores['stores']['phone'] = response['stores']['phone']
stores['stores']['latitude'] = response['stores']['latitude']
stores['stores']['longitude'] = response['stores']['longitude']
stores['stores']['services'] = response['stores']['services']
print(stores)
return stores
答案 0 :(得分:1)
一种方法是使用js2xml(免责声明:我写了js2xml)
因此,让我们假设你有一个带有<script>
元素的scrapy Selector和一些JavaScript数据:
>>> import scrapy
>>> html = '''<script>
... window.appData = {
... "stores": [
... { "id": "952",
... "name": "BAYTOWN TX",
... "addr1": "4620 garth rd",
... "city": "baytown",
... "state": "TX",
... "country": "US",
... "zipCode": "77521",
... "phone": "281-420-0079",
... "locationType": "Store",
... "locationSubType": "Big Box Store",
... "latitude": "29.77313",
... "longitude": "-94.97634"
... }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")
让我们从中提取JavaScript位:
>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n "stores": [\n { "id": "952",\n "name": "BAYTOWN TX",\n "addr1": "4620 garth rd",\n "city": "baytown",\n "state": "TX",\n "country": "US",\n "zipCode": "77521",\n "phone": "281-420-0079",\n "locationType": "Store",\n "locationSubType": "Big Box Store",\n "latitude": "29.77313",\n "longitude": "-94.97634"\n }]\n}\n'
现在,导入js2xml并调用.parse()
函数。你得到一个lxml树,代表JavaScript代码(它的AST的类别):
>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>
如果你有点好奇,这就是树的样子:
>>> print(js2xml.pretty_print(jstree))
<program>
<assign operator="=">
<left>
<dotaccessor>
<object>
<identifier name="window"/>
</object>
<property>
<identifier name="appData"/>
</property>
</dotaccessor>
</left>
<right>
<object>
<property name="stores">
<array>
<object>
<property name="id">
<string>952</string>
</property>
<property name="name">
<string>BAYTOWN TX</string>
</property>
<property name="addr1">
<string>4620 garth rd</string>
</property>
<property name="city">
<string>baytown</string>
</property>
<property name="state">
<string>TX</string>
</property>
<property name="country">
<string>US</string>
</property>
<property name="zipCode">
<string>77521</string>
</property>
<property name="phone">
<string>281-420-0079</string>
</property>
<property name="locationType">
<string>Store</string>
</property>
<property name="locationSubType">
<string>Big Box Store</string>
</property>
<property name="latitude">
<string>29.77313</string>
</property>
<property name="longitude">
<string>-94.97634</string>
</property>
</object>
</array>
</property>
</object>
</right>
</assign>
</program>
然后,您希望获得window.appData
(一个JavaScript对象)的正确部分。
您可以使用常规XPath调用来选择:
>>> jstree.xpath('''
... //assign[left//identifier[@name="appData"]]
... /right
... /*
... ''')
[<Element object at 0x7fc7f257f5f0>]
>>>
(即您需要<assign>
节点,在<left>
部分进行过滤,并获取<right>
部分的子项,即<object>
)
js2xml有帮助器将<object>
节点转换为Python dicts和列表(我们用[0]
选择xpath()调用的第一个结果):
>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
'city': 'baytown',
'country': 'US',
'id': '952',
'latitude': '29.77313',
'locationSubType': 'Big Box Store',
'locationType': 'Store',
'longitude': '-94.97634',
'name': 'BAYTOWN TX',
'phone': '281-420-0079',
'state': 'TX',
'zipCode': '77521'}]}
>>>