我正在尝试从产品页面中检索日期:http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371
但是日期隐藏在元信息中,请参见第一行:
<meta itemprop="datePublished" content="2014-11-27" />
</div><div id='80886327' itemprop="review" itemscope itemtype="http://schema.org/Review"><meta itemprop="itemReviewed" content="HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set" /><span itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">Rated <span itemprop="ratingValue">5</span> out of <span itemprop="bestRating">5</span></span>Â by <span itemprop="author">Razor</span><span itemprop="name"> solid construction
</span><span itemprop="description"> I spent the last month checking and looking at all tool boxes that I could find. Online and at available stores. In comparison to all, this is by far the best deal for the money. Quality, workmanship and construction of this is by far the best for the money. Some I looked at are twice as much money for the same quality... I have had this approx. a month and filled with tools and shop stuff and with the ball bearing drawers loaded, does not make any difference on drawer operation. Granted we still need the test of time..
你们知道如何将这些日期保存到列表中吗?
答案 0 :(得分:3)
您可以使用find_all()
获取meta
的所有itemprop="datePublished"
代码:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371'
soup = BeautifulSoup(urllib2.urlopen(url=url))
print [meta.get('content') for meta in soup.find_all('meta', itemprop='datePublished')]
打印:
[
'2014-11-27',
'2014-11-20',
'2014-12-15',
'2014-10-28',
'2014-10-10'
]
或者,使用CSS Selector
:
print [meta.get('content') for meta in soup.select('meta[itemprop="datePublished"]')]