Question

我是新手，所以据我所知，BeautifulSoup只提取标签内的数据（包括get，find，find_all等函数）。

我正在抓取的网站的源代码显示同一标签内的各种项目

这就是源代码中'item'的样子（所以困扰我的是这些项目只用逗号（，）分隔）：

{
                    "id1" : "121130815",
                    "id2" : "113840",

                }

我如何获得此项目？

非常感谢

Answer 1

这不是HTML，它是JSON。 BeautifulSoup是一个用于解析HTML代码的库，而不仅仅是网页。网页可以有多种不同的格式，具体取决于您对页面的定义。

在这种情况下，您面临的是返回JSON的网站，因此您需要选择正确的工具。您需要使用Python内置的JSON库中的ngOnInit() : void{ this.fromList =['A10', 'A20','A30', 'A35', 'A40']; this.toList =['A10', 'A20']; }。您可以阅读有关json模块here的更多信息。

您还应该阅读一些关于JSON的内容，因为您不熟悉它。 here是对格式的一个很好的介绍。

Answer 2

正如我和其他答案所提到的，json是非结构化数据的起点。

首先解析你的字符串......

import json

json_str = """{
                    "idannonce" : "121130815",
                    "idagence" : "113840",
                    "idtiers" : "169816",
                    "typedebien" : "Appartement",
                    "typedetransaction" : ["vente"],
                    "idtypepublicationsourcecouplage" : "SL",

                    ...

                    "si_sdEau" : "0",
                    "nb_photos" : "6",
                    "prix" : "745000",
                    "surface" : "76"
                }"""

json_data = json.loads(json_str)

print(json_data)

重要的是json.loads函数，它可以完成所有繁重工作，将json字符串解码为实际的python对象。

由此，我们得到一个dict对象，如下所示：

{'si_balcon': '1', 'affichagetype': [{'name': 'list', 'value': True}], 'codepostal': '75016', 'typedetransaction': ['vente'], 'naturebien': '1', 'etage': '1', 'position': '0', 'idtypechauffage': 'central', 'idtypecuisine': 'séparée', 'nb_photos': '6', 'prix': '745000', 'nb_pieces': '3', 'idtypecommerce': '0', 'idtypepublicationsourcecouplage': 'SL', 'si_sdEau': '0', 'codeinsee': '750116', 'cp': '75016', 'nb_chambres': '2', 'idagence': '113840', 'si_sdbain': '1', 'typedebien': 'Appartement', 'idannonce': '121130815', 'produitsvisibilite': 'AD:AC:AG:BB:AW', 'surface': '76', 'idtiers': '169816'}

现在，您可以通过循环迭代来访问所有数据，如下所示：

for key in json_data:
    print(key, ':', json_data[key])

打印出来：

si_balcon : 1
affichagetype : [{'name': 'list', 'value': True}]
codepostal : 75016
typedetransaction : ['vente']
naturebien : 1

...

produitsvisibilite : AD:AC:AG:BB:AW
surface : 76
idtiers : 169816

等等。您只需执行json_data[someKey]即可访问所需的任何元素。

刮痧（BeautifulSoup）没有标签

2 个答案: