如何使用Scrapy从javascript中提取jsonObj

时间:2016-10-06 12:00:35

标签: javascript json scrapy

我想构建一个jsonObj的字典。这是我到目前为止所拥有的。我还没弄明白如何提取json来解析它。

    def parse_store(self, response):
    jsonobj = response.xpath('//script[@window.appData//text').extract()
    stores = json.loads(jsonobj.body_as_unicode())
    print(stores)
    for stores in response:
        stores = {}
        stores['stores'] = response['stores']
        stores['stores']['id'] = response['stores']['id']
        stores['stores']['name'] = response['stores']['name']
        stores['stores']['addr1'] = response['stores']['addr1']
        stores['stores']['city'] = response['stores']['city']
        stores['stores']['state'] = response['stores']['state']
        stores['stores']['country'] = response['stores']['country']
        stores['stores']['zipCode'] = response['stores']['zipCode']
        stores['stores']['phone'] = response['stores']['phone']
        stores['stores']['latitude'] = response['stores']['latitude']
        stores['stores']['longitude'] = response['stores']['longitude']
        stores['stores']['services'] = response['stores']['services']
    print(stores)

    return stores

1 个答案:

答案 0 :(得分:1)

一种方法是使用js2xml(免责声明:我写了js2xml)

因此,让我们假设你有一个带有<script>元素的scrapy Selector和一些JavaScript数据:

>>> import scrapy
>>> html = '''<script>
... window.appData = {
...     "stores": [
...     {   "id": "952",
...         "name": "BAYTOWN TX",
...         "addr1": "4620 garth rd",
...         "city": "baytown",
...         "state": "TX",
...         "country": "US",
...         "zipCode": "77521",
...         "phone": "281-420-0079",
...         "locationType": "Store",
...         "locationSubType": "Big Box Store",
...         "latitude": "29.77313",
...         "longitude": "-94.97634"
...     }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")

让我们从中提取JavaScript位:

>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n    "stores": [\n    {   "id": "952",\n        "name": "BAYTOWN TX",\n        "addr1": "4620 garth rd",\n        "city": "baytown",\n        "state": "TX",\n        "country": "US",\n        "zipCode": "77521",\n        "phone": "281-420-0079",\n        "locationType": "Store",\n        "locationSubType": "Big Box Store",\n        "latitude": "29.77313",\n        "longitude": "-94.97634"\n    }]\n}\n'

现在,导入js2xml并调用.parse()函数。你得到一个lxml树,代表JavaScript代码(它的AST的类别):

>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>

如果你有点好奇,这就是树的样子:

>>> print(js2xml.pretty_print(jstree))
<program>
  <assign operator="=">
    <left>
      <dotaccessor>
        <object>
          <identifier name="window"/>
        </object>
        <property>
          <identifier name="appData"/>
        </property>
      </dotaccessor>
    </left>
    <right>
      <object>
        <property name="stores">
          <array>
            <object>
              <property name="id">
                <string>952</string>
              </property>
              <property name="name">
                <string>BAYTOWN TX</string>
              </property>
              <property name="addr1">
                <string>4620 garth rd</string>
              </property>
              <property name="city">
                <string>baytown</string>
              </property>
              <property name="state">
                <string>TX</string>
              </property>
              <property name="country">
                <string>US</string>
              </property>
              <property name="zipCode">
                <string>77521</string>
              </property>
              <property name="phone">
                <string>281-420-0079</string>
              </property>
              <property name="locationType">
                <string>Store</string>
              </property>
              <property name="locationSubType">
                <string>Big Box Store</string>
              </property>
              <property name="latitude">
                <string>29.77313</string>
              </property>
              <property name="longitude">
                <string>-94.97634</string>
              </property>
            </object>
          </array>
        </property>
      </object>
    </right>
  </assign>
</program>

然后,您希望获得window.appData(一个JavaScript对象)的正确部分。 您可以使用常规XPath调用来选择:

>>> jstree.xpath('''
...     //assign[left//identifier[@name="appData"]]
...         /right
...             /*
...     ''')
[<Element object at 0x7fc7f257f5f0>]
>>> 

(即您需要<assign>节点,在<left>部分进行过滤,并获取<right>部分的子项,即<object>

js2xml有帮助器将<object>节点转换为Python dicts和列表(我们用[0]选择xpath()调用的第一个结果):

>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
             'city': 'baytown',
             'country': 'US',
             'id': '952',
             'latitude': '29.77313',
             'locationSubType': 'Big Box Store',
             'locationType': 'Store',
             'longitude': '-94.97634',
             'name': 'BAYTOWN TX',
             'phone': '281-420-0079',
             'state': 'TX',
             'zipCode': '77521'}]}
>>>