在Div with Beautiful Soup中提取字典对象

时间:2016-07-19 03:51:13

标签: python web-scraping beautifulsoup

我有以下样本:

    [<div class="options__list">
    <a href="/link1">
    <div class="options__list__item" option-message="closed" data-option='{"id":1,"is_active":true,"name":"Fran","city":{"id":32,"name":"Paris","is_top":null,"url_key":"paris","main_area":{"id":null,"name":null,"url_key":null}}}'></div>
</a><a href="/link2">
    <div class="options__list__item" option-message="closed" data-option='{"id":2,"is_active":true,"name":"Fran2","city":{"id":32,"name":"Paris","is_top":null,"url_key":"paris","main_area":{"id":null,"name":null,"url_key":null}}}'></div>
</a>]

我想提取:

  1. href链接
  2. 字典“data-option”。
  3. 最好的方法是什么?而且,假设我只想从“数据选项”字典中提取特定键,我该怎么做?

    非常感谢提前。

1 个答案:

答案 0 :(得分:3)

想法是迭代链接,获取href属性值,然后找到内部选项列表项并使用json.loads()data-option值加载到python字典中:

import json

from bs4 import BeautifulSoup

data = """
<div>
    <div class="options__list">
        <a href="/link1">
            <div class="options__list__item" option-message="closed" data-option='{"id":1,"is_active":true,"name":"Fran","city":{"id":32,"name":"Paris","is_top":null,"url_key":"paris","main_area":{"id":null,"name":null,"url_key":null}}}'></div>
        </a>
        <a href="/link2">
            <div class="options__list__item" option-message="closed" data-option='{"id":2,"is_active":true,"name":"Fran2","city":{"id":32,"name":"Paris","is_top":null,"url_key":"paris","main_area":{"id":null,"name":null,"url_key":null}}}'></div>
        </a>
    </div>
</div>
"""

soup = BeautifulSoup(data, "html.parser")

for link in soup.select(".options__list > a"):
    href = link['href']
    data_option = json.loads(link.select_one("div.options__list__item")["data-option"])

    print(href, data_option['id'])

打印(打印href值和选项ID以用于演示目的):

(u'/link1', 1)
(u'/link2', 2)