如何从API网站上对数字进行网络抓取?

时间:2018-12-04 09:19:59

标签: python json api web-scraping python-requests

我正在尝试打包一个值,一个网站上总是在变化的值,我想获取实际值。

我尝试过:

[{"id":"1","title":"title","goal":"goal","exercice":"Exercice 
1","difficulty":"Beginner","duration":"3","image":"..","description":"..."},]

但是我收到此错误消息:

my_url = requests.get('https://www.telekom.hu/shop/categoryresults/https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')

data = my_url.text
parsed = json.loads(data)
my_number = parsed["totalNumRecs"]
print my_number

我怎么了?为什么我不能找回totalNumRecs内部的这个数字?

2 个答案:

答案 0 :(得分:1)

出现键错误的原因是返回的字典的嵌套结构。实际上存在totalNumRecs,但是不在dict的顶层。看看:

Find all occurrences of a key in nested python dictionaries and lists

这是一种遍历未知结构的字典并查找特定键的所有出现的方法。通过上述链接所激发的以下代码,我能够找到所需的键及其值:

import requests
import json


def gen_dict_extract(key, var):
    if hasattr(var, 'items'):
        for k, v in var.iteritems():
            if k == key:
                yield v
            if isinstance(v, dict):
                for result in gen_dict_extract(key, v):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in gen_dict_extract(key, d):
                        yield result



my_url = requests.get('https://www.telekom.hu/shop/categoryresults/https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')

data = my_url.text
parsed = json.loads(data)

result = gen_dict_extract('totalNumRecs', parsed)

for i in result:
    print(i)

答案 1 :(得分:1)

您需要为所需密钥指定完整的“路径”:

my_url = requests.get('https://www.telekom.hu/shop/categoryresults/https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')
data = my_url.json()
my_number = data['MainContent'][0]['contents'][0]['totalNumRecs']
print my_number