我正在尝试从Tripadvisor删除仅针对特定航空公司Spicejet的一年的评论。 链接:https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS
但是,由于某些评论属于span类值,因此存储评论的日期不一致:<span class="ratingDate">
Reviewed October 22, 2018
</span>
其中一些标题:
<span class="ratingDate relativeDate" title="October 23, 2018">
Reviewed 5 weeks ago
</span>
我想提取日期并设置一个条件,以提取仅一年之久的评论。我在处理两种日期格式时遇到困难,所以我应该如何比较它。
代码:
date = items.find(class_="ratingDate").get("title")
date = dt.strptime(date, "%B %d, %Y")
if (date > dt.strptime(('November 26 2017'),"%B %d %Y")):
date = items.find('span', class_='ratingDate')['title']
输出:
“可管理”
<ipython-input-72-3d5de04a2794> in get_info()
6 for items in soup.find_all(class_="innerBubble"):
7 date = items.find(class_="ratingDate").get("title")
----> 8 date = dt.strptime(date, "%B %d, %Y")
9 if (date > dt.strptime(('November 26 2017'),"%B %d %Y")):
10 print("===========================================")
TypeError: strptime() argument 1 must be str, not None
答案 0 :(得分:1)
您可以做很多工作,也可以跟踪数据的来源,并对源进行模糊处理,直到发现更喜欢的内容。这里看起来好像是从加载数据:
https://www.tripadvisor.com/AirlineTips
正如您所指出的,这很丑陋。
它为我拨打的确切电话是:
https://www.tripadvisor.com/AirlineTips?d=8728949&inline=true
哪个吐出来:
<div class="page page1">
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635739734" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635739734">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/85/avatar006.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">
<blockquote>“Value for Money”</blockquote>
<span class="ui_bubble_rating bubble_4" alt="4.0 of 5 bubbles"></span>
Santhoshpp, 2 days ago
<span class="pipe">|</span> <a href="/ShowUserReviews-g1-d8728949-r635739734-SpiceJet-World.html" onclick="ta.trackEventOnPage('Tab Content', 'read_review', 'Read Review');">Read review</a> </div> </div>
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635711432" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635711432">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/99/avatar025.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">
嗯,可怕。
让我们尝试根据该请求更改inline=false
...
https://www.tripadvisor.com/AirlineTips?d=8728949&inline=false
给我们
script> new Asset.css('https://static.tacdn.com/css2/accommodations/room_tips_overlay-v22801712797b.css');</script>
<div id="TIPSOVERLAY" class="wrap">
<div class="title">
<span class="fl">
See travel tips for airlines </span>
</div>
<div class="content">
<div class="tip"><span class="tipBody">“Value for Money” (Santhoshpp) </span>
<div class="rsImg">
<span class="ui_bubble_rating bubble_4"></span>
<span class="dateAuthor">Nov 25, 2018</span>
</div>
</div>
<div class="tip"><span class="tipBody">“carry your own entertainment stuff and be ready if your flight gets delayed” (vbroams) </span>
<div class="rsImg">
<span class="ui_bubble_rating bubble_3"></span>
<span class="dateAuthor">Nov 25, 2018</span>
</div>
</div>
哦,性感的藤壶,蝙蝠侠!在那里。现在,我们不必在Python或其他方面与日期作斗争。
tl;博士
不要刮汤,不要刮汤。任何动态内容下都有一个API。
答案 1 :(得分:0)
据我了解,您不必比较两个日期值,因为它们都表示相同的日期。因此,对于每个评论,请检查是否存在跨度课程日期或标题日期。如果两者都存在,则仅检查一项。可以使用strptime完成检查。
对于标题日期,您将需要timedelta。
span_date = None
title_date = None
one_year_ago_date = datetime.now().replace(year=dt.year-3)
# ADD CODE HERE to get date strings for span_date and title_date
# Assume span_date = "October 22, 2018"
review_date = None
if span_date is not None:
review_date = datetime.datetime.strptime(span_date, "%B %d, %Y").date()
# Assume title_date = "5 weeks ago"
elif title_date is not None:
title_date = [title_date .split()[:2]]
time_dict = dict((fmt, float(amount)) for amount,fmt in title_date)
dt = datetime.timedelta(**time_dict)
review_date = datetime.datetime.today() - dt
# Check if review_date is earlier than one year ago
if review_date.date() < one_year_ago_date:
print("Save this review")
答案 2 :(得分:0)
您可以利用类选择器.ratingDate
来利用CSS在类上匹配的方式来拉回所有审阅日期。它将与.ratingDate
和.ratingDate.relativeDate
匹配。您将发现,匹配的元素类的len为2,其中日期位于元素的title属性中,即。类别为ratingDate relativeDate
的元素。
<span class="ratingDate relativeDate" title="October 26, 2018">Reviewed 4 weeks ago
</span>
您也可以通过类选择器来获取评论文本。压缩并转到列表。
下面是没有日期过滤的轮廓。筛选早于此日期的日期(但随后您将需要一个索引来链接列表以匹配日期和审阅文本)或从此处开始。日期都是一致的格式。
import requests
from bs4 import BeautifulSoup
url = 'https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS'
data = requests.get(url).content
soup = BeautifulSoup(data,'lxml')
dateStrings = soup.select('.ratingDate')
reviewStrings = soup.select('.partial_entry')
reviewDates = [date['title'].strip() if len(date['class']) == 2 else date.text.strip().replace('Reviewed ','') for date in dateStrings]
reviews = [review.text.strip() for review in reviewStrings]
allInfo = list(zip(updatedDates,reviews))