如何从两个html页面中提取数据?

时间:2017-08-31 19:41:59

标签: python python-2.7 beautifulsoup

我想从两个html页面中提取数据。当我从一个页面中取出数据并向另一个页面移动一些元素更改时,数据会出现在列表和列表更改中。

我的以下问题的代码

details_containers = soup_page.findAll("div",{"id":"RESTAURANT_DETAILS"})
       details_container = details_containers[0].findAll("div",{"class":"content"})
       cuisine = details_container[0].text.strip()
       print(cuisine)
       meals = details_container[1].text.strip()
       print(meals)
       hotel_features = details_container[2].text.strip()
       print(hotel_features)

从第一个html我想要美食,美食,retaurant_features内容价值。但是还有一些额外的小时值,平均价格。

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
    <div class="header_with_improve wrap">
        <a href="/UpdateListing-g297595-d6384395-Ocellus-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
            <div class="improve_listing_btn ui_button primary">Improve this listing</div>
        </a>
        <h3 class="tabs_header">Restaurant Details</h3> </div>
    <div class="details_tab">
        <div class="table_section">
            <div class="row">
                <div class="ratingSummary wrap">
                    <div class="histogramCommon bubbleHistogram wrap">
                        <div class="colTitle">
                            Rating summary
                        </div>
                        <ul class="barChart">
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Food</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Service</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Value</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                        </ul>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Average prices
                </div>
                <div class="content">
                    <span>₹&nbsp;448 -
₹&nbsp;768</span>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Cuisine
                </div>
                <div class="content">
                    <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-c26-Raipur_Raipur_District_Chhattisgarh.html">Italian</a>, <a href="/Restaurants-g297595-c20-Raipur_Raipur_District_Chhattisgarh.html">French</a>, <a href="/Restaurants-g297595-c11-Raipur_Raipur_District_Chhattisgarh.html">Chinese</a>, <a href="/Restaurants-g297595-c22-Raipur_Raipur_District_Chhattisgarh.html">International</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Meals
                </div>
                <div class="content">
                    Breakfast, Lunch, Dinner, Brunch
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Restaurant features
                </div>
                <div class="content">
                    Reservations, Seating, Takeout, Private Dining, Waitstaff
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Good for
                </div>
                <div class="content">
                    Groups, Business meetings, Child-friendly
                </div>
            </div>
            <div class="row">
                <div class="hours title">
                    Open Hours
                </div>
                <div class="hours content">
                    <div class="detail">
                        <span class="day">Sunday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Monday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Tuesday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Wednesday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Thursday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Friday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Saturday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                </div>
            </div>
        </div>
        <div class="additional_info">
            <div class="title">
                Location and Contact Information </div>
            <div class="content">
                <ul class="detailsContent">
                    <li>
                        <div class="detail">Address:
                            <span> <span class="format_address"><span class="street-address">G.E. Road</span> | <span class="extended-address">Mayura Hotel</span>, <span class="locality">Raipur 492001, </span><span class="country-name">India</span> </span>
                            </span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Location:
                            <span> Asia</span>
                            <span> &nbsp;&gt;&nbsp; India</span>
                            <span> &nbsp;&gt;&nbsp; Chhattisgarh</span>
                            <span> &nbsp;&gt;&nbsp; Raipur District</span>
                            <span> &nbsp;&gt;&nbsp; Raipur</span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Phone Number:
                            <span>+91 77142 00500</span>
                        </div>
                    </li>
                    <li>
                        <span class="ui_icon email"></span>
                        <a target="_blank&quot;" href="mailto:banquet@themayurahotels.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','6384395')">
E-mail </a>
                    </li>
                    <!--trkP:waypoint_for_poi_2-->
                    <!-- PLACEMENT waypoint_for_poi -->
                    <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
                    </div>
                    <!--etk-->
                </ul>
            </div>
        </div>
        <!--[if lte IE 9]>
            <style>
                .details_block .threeColumnList{
                    height: 350px;
                    overflow: auto;
                }
            </style>
            <![endif]-->
    </div>
</div>

从第二个html我想要美食,餐饮,retaurant_features内容值,如上面的HTML。 但是在这个额外的小时内容值中,平均价格不存在

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
    <div class="header_with_improve wrap">
        <a href="/UpdateListing-g297595-d8595502-Barbeque_Nation-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
            <div class="improve_listing_btn ui_button primary">Improve this listing</div>
        </a>
        <h3 class="tabs_header">Restaurant Details</h3> </div>
    <div class="details_tab">
        <div class="table_section">
            <div class="row">
                <div class="ratingSummary wrap">
                    <div class="histogramCommon bubbleHistogram wrap">
                        <div class="colTitle">
                            Rating summary
                        </div>
                        <ul class="barChart">
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Food</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Service</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Value</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_40" alt="4.0 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                        </ul>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Cuisine
                </div>
                <div class="content">
                    <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c6-Raipur_Raipur_District_Chhattisgarh.html">Barbecue</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>, <a href="/Restaurants-g297595-zfz10697-Raipur_Raipur_District_Chhattisgarh.html">Vegan Options</a>, <a href="/Restaurants-g297595-zfz10992-Raipur_Raipur_District_Chhattisgarh.html">Gluten Free Options</a>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Meals
                </div>
                <div class="content">
                    Lunch, Dinner
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Restaurant features
                </div>
                <div class="content">
                    Reservations, Seating, Waitstaff, Wheelchair Accessible, Validated Parking
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Good for
                </div>
                <div class="content">
                    Groups, Special Occasion Dining, Kids, Child-friendly
                </div>
            </div>
        </div>
        <div class="additional_info">
            <div class="title">
                Location and Contact Information </div>
            <div class="content">
                <ul class="detailsContent">
                    <li>
                        <div class="detail">Address:
                            <span> <span class="format_address"> | <span class="extended-address">Magneto The Mall, 2nd Floor</span>, <span class="locality">Raipur 429010, </span><span class="country-name">India</span> </span>
                            </span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Location:
                            <span> Asia</span>
                            <span> &nbsp;&gt;&nbsp; India</span>
                            <span> &nbsp;&gt;&nbsp; Chhattisgarh</span>
                            <span> &nbsp;&gt;&nbsp; Raipur District</span>
                            <span> &nbsp;&gt;&nbsp; Raipur</span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Phone Number:
                            <span>+91 77160 60008</span>
                        </div>
                    </li>
                    <li>
                        <span class="ui_icon email"></span>
                        <a target="_blank&quot;" href="mailto:feedback@barbeque-nation.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','8595502')">
    E-mail </a>
                    </li>
                    <!--trkP:waypoint_for_poi_2-->
                    <!-- PLACEMENT waypoint_for_poi -->
                    <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
                    </div>
                    <!--etk-->
                </ul>
            </div>
        </div>
        <!--[if lte IE 9]>
                <style>
                    .details_block .threeColumnList{
                        height: 350px;
                        overflow: auto;
                    }
                </style>
                <![endif]-->
    </div>
</div>

1 个答案:

答案 0 :(得分:0)

您可以找到包含标题的所有<div class="row">,而不是获取所有rows = details_container.findAll('div', {'class': 'row'}) # used to store data extracted from HTML <div class="row"> elements data = {} for row in rows: title = row.find('div', {'class': 'title'}) content = row.find('div', {'class': 'content'}) if title and content: # here I am just formatting the dict key to be more python-ish. totally optional title = title.text.strip().lower().replace(' ', '-') data[title] = content # tested with the HTML from the first page print data.keys() #=> [u'cuisine', u'restaurant-features', u'average-prices', u'good-for', u'open-hours', u'meals'] print type(data['cuisine']) #=> <class 'bs4.element.Tag'> 块的列表并通过其索引选择多个(而是从第一页更改为第二页)。各自的内容。

test

现在,您可以从HTML网页中提取内容项,而无需关心它们出现的顺序。此代码适用于任何具有相同通用结构的HTML你提供的两个页面。我希望这有帮助!