刮一张桌子

时间:2017-04-11 14:02:25

标签: web-scraping beautifulsoup

我正在尝试提取房屋的属性和相应的值。我有兴趣获得{key:{物业类型:商业物业,购买价格:475,000瑞士法郎等}

我能够逐个提取值,但不能作为更新字典的循环。

<dl class="row xsmall-up-2 medium-up-3 large-up-4 attributes-grid">
    <div class="column">
        <dt class="label-text">
            Property type
        </dt>
        <dd>
Commercial property            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Purchase price
        </dt>
        <dd>
CHF 475,000            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Floor space
        </dt>
        <dd>
114 m&sup2;            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Floor 
        </dt>
        <dd>
1. floor             </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Year of construction
        </dt>
        <dd>
1989            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Balcony/ies
        </dt>
        <dd>
                <i class="fa fa-check text-green" aria-hidden="true"></i>
        </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Indoor parking 
        </dt>
        <dd>
                <i class="fa fa-check text-green" aria-hidden="true"></i>
        </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Outdoor parking
        </dt>
        <dd>
                <i class="fa fa-check text-green" aria-hidden="true"></i>
        </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Lift
        </dt>
        <dd>
                <i class="fa fa-check text-green" aria-hidden="true"></i>
        </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Cable TV
        </dt>
        <dd>
                <i class="fa fa-check text-green" aria-hidden="true"></i>
        </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Public transport stop
        </dt>
        <dd>
150 m            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Motorway
        </dt>
        <dd>
500 m            </dd>
    </div>
    <div class="column">
        <dt class="label-text">
            Shops
        </dt>
        <dd>
300 m            </dd>
    </div>
</dl>

1 个答案:

答案 0 :(得分:1)

考虑您提供的html文本,该文本以table_text中的字符串形式存储。

from bs4 import BeautifulSoup
soup = BeautifulSoup(table_text,"lxml")
temp_dict = {}
for d in soup.find_all("div",{"class":"column"}):
    temp_dict[d.find("dt").text.strip()] = d.find("dd").text.strip()
print(temp_dict)

我猜你提供的html文本只用于表的一行,如果你想要所有的行,循环它们并保留一个父词典,你将行更新为一个键,temp_dict作为一个值每次迭代。这将为您提供所需的结构。