我想从两个html页面中提取数据。当我从一个页面中取出数据并向另一个页面移动一些元素更改时,数据会出现在列表和列表更改中。
我的以下问题的代码
details_containers = soup_page.findAll("div",{"id":"RESTAURANT_DETAILS"})
details_container = details_containers[0].findAll("div",{"class":"content"})
cuisine = details_container[0].text.strip()
print(cuisine)
meals = details_container[1].text.strip()
print(meals)
hotel_features = details_container[2].text.strip()
print(hotel_features)
从第一个html我想要美食,美食,retaurant_features内容价值。但是还有一些额外的小时值,平均价格。
<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<a href="/UpdateListing-g297595-d6384395-Ocellus-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
<div class="improve_listing_btn ui_button primary">Improve this listing</div>
</a>
<h3 class="tabs_header">Restaurant Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating summary
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Food</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Value</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<div class="row">
<div class="title">
Average prices
</div>
<div class="content">
<span>₹ 448 -
₹ 768</span>
</div>
</div>
<div class="row">
<div class="title">
Cuisine
</div>
<div class="content">
<a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-c26-Raipur_Raipur_District_Chhattisgarh.html">Italian</a>, <a href="/Restaurants-g297595-c20-Raipur_Raipur_District_Chhattisgarh.html">French</a>, <a href="/Restaurants-g297595-c11-Raipur_Raipur_District_Chhattisgarh.html">Chinese</a>, <a href="/Restaurants-g297595-c22-Raipur_Raipur_District_Chhattisgarh.html">International</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>
</div>
</div>
<div class="row">
<div class="title">
Meals
</div>
<div class="content">
Breakfast, Lunch, Dinner, Brunch
</div>
</div>
<div class="row">
<div class="title">
Restaurant features
</div>
<div class="content">
Reservations, Seating, Takeout, Private Dining, Waitstaff
</div>
</div>
<div class="row">
<div class="title">
Good for
</div>
<div class="content">
Groups, Business meetings, Child-friendly
</div>
</div>
<div class="row">
<div class="hours title">
Open Hours
</div>
<div class="hours content">
<div class="detail">
<span class="day">Sunday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Monday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Tuesday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Wednesday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Thursday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Friday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
<div class="detail">
<span class="day">Saturday</span>
<span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
</div>
</div>
</div>
</div>
<div class="additional_info">
<div class="title">
Location and Contact Information </div>
<div class="content">
<ul class="detailsContent">
<li>
<div class="detail">Address:
<span> <span class="format_address"><span class="street-address">G.E. Road</span> | <span class="extended-address">Mayura Hotel</span>, <span class="locality">Raipur 492001, </span><span class="country-name">India</span> </span>
</span>
</div>
</li>
<li>
<div class="detail">Location:
<span> Asia</span>
<span> > India</span>
<span> > Chhattisgarh</span>
<span> > Raipur District</span>
<span> > Raipur</span>
</div>
</li>
<li>
<div class="detail">Phone Number:
<span>+91 77142 00500</span>
</div>
</li>
<li>
<span class="ui_icon email"></span>
<a target="_blank"" href="mailto:banquet@themayurahotels.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','6384395')">
E-mail </a>
</li>
<!--trkP:waypoint_for_poi_2-->
<!-- PLACEMENT waypoint_for_poi -->
<div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
</div>
<!--etk-->
</ul>
</div>
</div>
<!--[if lte IE 9]>
<style>
.details_block .threeColumnList{
height: 350px;
overflow: auto;
}
</style>
<![endif]-->
</div>
</div>
从第二个html我想要美食,餐饮,retaurant_features内容值,如上面的HTML。 但是在这个额外的小时内容值中,平均价格不存在
<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<a href="/UpdateListing-g297595-d8595502-Barbeque_Nation-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
<div class="improve_listing_btn ui_button primary">Improve this listing</div>
</a>
<h3 class="tabs_header">Restaurant Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating summary
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Food</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Value</span>
</div>
<div class="wrap row part ">
<span class="ui_bubble_rating bubble_40" alt="4.0 of 5 bubbles"></span>
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<div class="row">
<div class="title">
Cuisine
</div>
<div class="content">
<a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c6-Raipur_Raipur_District_Chhattisgarh.html">Barbecue</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>, <a href="/Restaurants-g297595-zfz10697-Raipur_Raipur_District_Chhattisgarh.html">Vegan Options</a>, <a href="/Restaurants-g297595-zfz10992-Raipur_Raipur_District_Chhattisgarh.html">Gluten Free Options</a>
</div>
</div>
<div class="row">
<div class="title">
Meals
</div>
<div class="content">
Lunch, Dinner
</div>
</div>
<div class="row">
<div class="title">
Restaurant features
</div>
<div class="content">
Reservations, Seating, Waitstaff, Wheelchair Accessible, Validated Parking
</div>
</div>
<div class="row">
<div class="title">
Good for
</div>
<div class="content">
Groups, Special Occasion Dining, Kids, Child-friendly
</div>
</div>
</div>
<div class="additional_info">
<div class="title">
Location and Contact Information </div>
<div class="content">
<ul class="detailsContent">
<li>
<div class="detail">Address:
<span> <span class="format_address"> | <span class="extended-address">Magneto The Mall, 2nd Floor</span>, <span class="locality">Raipur 429010, </span><span class="country-name">India</span> </span>
</span>
</div>
</li>
<li>
<div class="detail">Location:
<span> Asia</span>
<span> > India</span>
<span> > Chhattisgarh</span>
<span> > Raipur District</span>
<span> > Raipur</span>
</div>
</li>
<li>
<div class="detail">Phone Number:
<span>+91 77160 60008</span>
</div>
</li>
<li>
<span class="ui_icon email"></span>
<a target="_blank"" href="mailto:feedback@barbeque-nation.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','8595502')">
E-mail </a>
</li>
<!--trkP:waypoint_for_poi_2-->
<!-- PLACEMENT waypoint_for_poi -->
<div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
</div>
<!--etk-->
</ul>
</div>
</div>
<!--[if lte IE 9]>
<style>
.details_block .threeColumnList{
height: 350px;
overflow: auto;
}
</style>
<![endif]-->
</div>
</div>
答案 0 :(得分:0)
您可以找到包含标题的所有<div class="row">
,而不是获取所有rows = details_container.findAll('div', {'class': 'row'})
# used to store data extracted from HTML <div class="row"> elements
data = {}
for row in rows:
title = row.find('div', {'class': 'title'})
content = row.find('div', {'class': 'content'})
if title and content:
# here I am just formatting the dict key to be more python-ish. totally optional
title = title.text.strip().lower().replace(' ', '-')
data[title] = content
# tested with the HTML from the first page
print data.keys()
#=> [u'cuisine', u'restaurant-features', u'average-prices', u'good-for', u'open-hours', u'meals']
print type(data['cuisine'])
#=> <class 'bs4.element.Tag'>
块的列表并通过其索引选择多个(而是从第一页更改为第二页)。各自的内容。
test
现在,您可以从HTML网页中提取内容项,而无需关心它们出现的顺序。此代码适用于任何具有相同通用结构的HTML你提供的两个页面。我希望这有帮助!