仅适用于过去一年的网络抓取评论

时间:2018-11-27 06:22:53

标签: python web-scraping beautifulsoup

我正在尝试从Tripadvisor删除仅针对特定航空公司Spicejet的一年的评论。 链接:https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS

但是,由于某些评论属于span类值,因此存储评论的日期不一致:<span class="ratingDate"> Reviewed October 22, 2018 </span>

其中一些标题:

<span class="ratingDate relativeDate" title="October 23, 2018"> Reviewed 5 weeks ago </span>

我想提取日期并设置一个条件,以提取仅一年之久的评论。我在处理两种日期格式时遇到困难,所以我应该如何比较它。

代码:

date = items.find(class_="ratingDate").get("title") date = dt.strptime(date, "%B %d, %Y") if (date > dt.strptime(('November 26 2017'),"%B %d %Y")): date = items.find('span', class_='ratingDate')['title']

输出:

“可管理”

('October 23,2018',)

<ipython-input-72-3d5de04a2794> in get_info()
  6         for items in soup.find_all(class_="innerBubble"):
  7             date = items.find(class_="ratingDate").get("title")
  ----> 8             date = dt.strptime(date, "%B %d, %Y")
  9             if (date > dt.strptime(('November 26 2017'),"%B %d %Y")):
 10                 print("===========================================")

 TypeError: strptime() argument 1 must be str, not None

3 个答案:

答案 0 :(得分:1)

您可以做很多工作,也可以跟踪数据的来源,并对源进行模糊处理,直到发现更喜欢的内容。这里看起来好像是从加载数据:
https://www.tripadvisor.com/AirlineTips
正如您所指出的,这很丑陋。

它为我拨打的确切电话是:
https://www.tripadvisor.com/AirlineTips?d=8728949&inline=true

哪个吐出来:

<div class="page page1">
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635739734" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635739734">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/85/avatar006.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">
<blockquote>&#x201c;Value for Money&#x201d;</blockquote>
<span class="ui_bubble_rating bubble_4" alt="4.0 of 5 bubbles"></span>
Santhoshpp, 2 days ago
<span class="pipe">|</span> <a href="/ShowUserReviews-g1-d8728949-r635739734-SpiceJet-World.html" onclick="ta.trackEventOnPage('Tab Content', 'read_review', 'Read Review');">Read review</a> </div> </div>
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635711432" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635711432">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/99/avatar025.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">

嗯,可怕。

让我们尝试根据该请求更改inline=false ... https://www.tripadvisor.com/AirlineTips?d=8728949&inline=false
给我们

script> new Asset.css('https://static.tacdn.com/css2/accommodations/room_tips_overlay-v22801712797b.css');</script>
<div id="TIPSOVERLAY" class="wrap">
<div class="title">
<span class="fl">
See travel tips for airlines </span>
</div>
<div class="content">
<div class="tip"><span class="tipBody">&#x201c;Value for Money&#x201d; (Santhoshpp) </span>
<div class="rsImg">
<span class="ui_bubble_rating bubble_4"></span>
<span class="dateAuthor">Nov 25, 2018</span>
</div>
</div>
<div class="tip"><span class="tipBody">&#x201c;carry your own entertainment stuff and be ready if your flight gets delayed&#x201d; (vbroams) </span>
<div class="rsImg">
<span class="ui_bubble_rating bubble_3"></span>
<span class="dateAuthor">Nov 25, 2018</span>
</div>
</div>

哦,性感的藤壶,蝙蝠侠!在那里。现在,我们不必在Python或其他方面与日期作斗争。

tl;博士
不要刮汤,不要刮汤。任何动态内容下都有一个API。

答案 1 :(得分:0)

据我了解,您不必比较两个日期值,因为它们都表示相同的日期。因此,对于每个评论,请检查是否存在跨度课程日期或标题日期。如果两者都存在,则仅检查一项。可以使用strptime完成检查。

对于标题日期,您将需要timedelta

span_date = None
title_date = None
one_year_ago_date = datetime.now().replace(year=dt.year-3)

# ADD CODE HERE to get date strings for span_date and title_date

# Assume span_date = "October 22, 2018"
review_date = None
if span_date is not None:
    review_date = datetime.datetime.strptime(span_date, "%B %d, %Y").date()

# Assume title_date = "5 weeks ago"
elif title_date is not None:
    title_date = [title_date .split()[:2]]
    time_dict = dict((fmt, float(amount)) for amount,fmt in title_date)
    dt = datetime.timedelta(**time_dict)
    review_date = datetime.datetime.today() - dt

# Check if review_date is earlier than one year ago
if review_date.date() < one_year_ago_date:
    print("Save this review")

答案 2 :(得分:0)

您可以利用类选择器.ratingDate来利用CSS在类上匹配的方式来拉回所有审阅日期。它将与.ratingDate.ratingDate.relativeDate匹配。您将发现,匹配的元素类的len为2,其中日期位于元素的title属性中,即。类别为ratingDate relativeDate的元素。

<span class="ratingDate relativeDate" title="October 26, 2018">Reviewed 4 weeks ago
</span>

您也可以通过类选择器来获取评论文本。压缩并转到列表。

下面是没有日期过滤的轮廓。筛选早于此日期的日期(但随后您将需要一个索引来链接列表以匹配日期和审阅文本)或从此处开始。日期都是一致的格式。

import requests
from bs4 import BeautifulSoup
url = 'https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS'
data = requests.get(url).content
soup = BeautifulSoup(data,'lxml')
dateStrings = soup.select('.ratingDate')  
reviewStrings = soup.select('.partial_entry')
reviewDates = [date['title'].strip() if len(date['class']) == 2 else date.text.strip().replace('Reviewed ','') for date in dateStrings]
reviews = [review.text.strip() for review in reviewStrings]
allInfo = list(zip(updatedDates,reviews))