使用for循环进行网页抓取-无法“传递”某些数据

时间:2019-08-16 09:34:27

标签: python-3.x web-scraping beautifulsoup

下面的代码应该可以刮除该评级以及该评级的发布日期。

这里的问题是,员工回答否定评论,并且帖子的日期也被删除。因此,当我抓取该网站时,评分和日期的数量不平衡(20个评分与24个日期),因为其中四个日期属于员工给出的答案。

在代码中,每次出现“ ugc-brand-response”类时,我都会尝试通过“此类”,以供员工回答。因此,如果没有满足ugc类,则“通过”,如果不继续,则-不会存储任何数据。甚至没有前几条评论。

我从阅读别人的问题和答案中学到了很多东西。感谢这个很棒的社区。

import requests
import time
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"

url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')


rating_n_date=[] 

for rating in soup.find_all(attrs={"class": "c-review-average"}):     
    rating_n_date.append(rating .text)
for date in soup.findAll(attrs={"class":"submission-date"}):
    if "class" == "ugc-brand-response" in date:
        pass    
    else:
        continue
    rating_n_date.append(date.text)
time.sleep(2)
print(rating_n_date)

以下数据包括:

<li class="review-item" tabindex="-1"><div class="row"><div class="hidden-xs hidden-sm col-md-3"><div class="undefined ugc-author v-fw-medium body-copy-lg">Jimmy</div><ul class=" ugc-badge-list"><li class="visible-xs-inline-block visible-sm-inline-block visible-md-block visible-lg-block"><span class="c-overlay-wrapper"><span class="overlayTrigger"><button aria-expanded="false" aria-controls="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" aria-owns="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" data-track="Custom"><div class="ugc-my-bby-badge"><img alt="My Best Buy® Member" src="https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-mybby-core.svg"></div></button></span><span></span></span></li></ul></div><div class="col-xs-12 col-md-9"><div class="c-ratings-reviews v-medium"><p class="sr-only">Rating: 2 out of 5 stars</p><span class="c-stars c-stars-medium" alt="40%" aria-hidden="true"><span class="unfilled"></span><span class="filled" style="width:40%"></span></span><span class="c-reviews"><span class="c-review-average" aria-hidden="true">2</span></span></div><h3 id="review-id-bf28b82b-76f5-3c85-897e-598a91bbd8a8" class="ugc-review-title c-section-title heading-5 v-fw-medium  ">A disappointment: low volum, weak bass, distorts</h3><div class="disclaimer">Posted <time class="submission-date" title="Apr 28, 2019 11:29 PM">3 months ago</time></div>

这是我不想要的代码-员工回答:

<ul class="ugc-brand-response-list"><li><div class="row"><div class="col-sm-12 col-md-9 col-md-offset-3"><div class="ugc-brand-response"><h4 class=" c-section-title body-copy-lg v-fw-medium  ">Brand response</h4><p class="body-copy-lg">Jabra</p><div class="disclaimer"><time class="submission-date" title="Apr 29, 2019 8:46 AM">3 months ago</time></div><div class="ugc-brand-response-body body-copy-lg"><p class="pre-white-space">
Hello Jimmy - We were sorry to learn that the Jabra Elite 85h did not meet your expectations.  As the Elite 85h is a relatively new product, it is very important that you update the firmware in the headphones as often as necessary to keep up-to-date.  We are constantly improving all aspects of the Elite 85h through firmware updates.  If you have any specific questions or concers, we invite you to contact us directly by completing the web form at&nbsp;<a href="https://www.jabra.com/ServiceMenu/contact/ContactJabraSupport/ContactJabraSupportConsumer" target="_blank" rel="nofollow noopener noreferrer" style="word-break: break-all;">https://www.jabra.com/ServiceMenu/contact/ContactJabraSupport/ContactJabraSupportConsumer</a>, or by giving us a call - we love to help!  Thank you.
<img src="https://s3.amazonaws.com/stratos-logos/logos/Jabra.jpg" alt="Jabra" title="Jabra" style="display: block !important; margin-top: 2em !important; border: 1px solid #ccc !important; padding: 2px !important; background-color: white !important;">
<!--[if ReviewResponse]><![endif]--></p></div></div></div></div></li></ul>

1 个答案:

答案 0 :(得分:0)

它将永远不会跳过类的位置ugc-brand-response-list,因为您显式地提取了具有类属性submission-date的所有内容

您还误解了continue。当您使用continue时,并不意味着您会以为“继续使用代码”的含义。真正的意思是,“就在这里停止。不要继续循环的其余部分。转到下一个项目。”因此,以您在代码中拥有它的方式,当找不到class == "ugc-brand-response"时,它将转到else,它表示continue。因此它永远不会追加到您的列表中,这就是为什么您的数据不被存储/附加的原因。

您可以做的是,转到父标签,然后拉取整个带有类别属性"col-xs-12 col-md-9"的评论“块”,然后从那里进入每个评论并得出评分和提交日期一起使用findfind会首次出现您要查找的内容,这意味着它不会抓住员工答复的日期),然后将其存储到一个列表。然后,我将其放入数据框/表中。

import requests
import time
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"

url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')


rating_list = [] 
date_list = []

for ratings in soup.find_all(attrs={"class": "col-xs-12 col-md-9"}):     
    rating = ratings.find('span', {'class':'c-review-average'}).text
    submission_date = ratings.find('time', {'class':'submission-date'}).text

    rating_list.append(rating)
    date_list.append(submission_date)


data = {'Rating':rating_list, 'Date':date_list}
df = pd.DataFrame(data)

输出:

print (df)
   Rating          Date
0       5  3 months ago
1       5  2 months ago
2       4  3 months ago
3       3  3 months ago
4       4  3 months ago
5       4  3 months ago
6       5  3 months ago
7       4  3 months ago
8       5  3 months ago
9       4    1 week ago
10      4  3 months ago
11      2  3 months ago
12      5  3 months ago
13      4     1 day ago
14      4   1 month ago
15      4  3 months ago
16      4   3 weeks ago
17      2  3 months ago
18      3   3 weeks ago
19      5  3 months ago