Question

尝试构建一个脚本，从Yelp中删除相应评论中的文本和星号，并将数据存储在Excel文件中。

我正在使用的HTML代码片段如下：

<div class="review-content">
<div class="biz-rating biz-rating-large clearfix">
<div>
<div class="i-stars i-stars--regular-5 rating-large" title="5.0 star 
rating">
<img alt="5.0 star rating" class="offscreen" height="303" 
src="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_design_web/41341496d9db/assets/img/stars/stars.png" width="84"/>
</div>
</div>
<span class="rating-qualifier">
    5/10/2017
</span>
</div>
<p lang="en">This place is really fun and cute. I was happy to discover 
it.. <br/><br/>They also have beer and wine here, which is kind of a 
nice bonus. The sangria is good..</p>
</div>

我的python代码如下：

import requests
from bs4 import BeautifulSoup as soup
import xlsxwriter

#Index for xlsxwriter
row = 1
i = 0

#Index for all of the review-containing pages for one restaurant.

page_num = 0

#Call xlsxwriter and name the output file.
workbook = xlsxwriter.Workbook('file_1.xlsx')
worksheet = workbook.add_worksheet()

#Write in the header for the file
worksheet.write('A1','num_stars')
worksheet.write('B1', 'review_text')

#Loop to scrape all of the reviews off of one single page with a 
specific url and advance the to all subsequent pages of the restaurant.

while page_num <= 260:

  url = "https://www.yelp.com/biz/monkey-house-cafe-huntington-beach?
  start=%s" % page_num

  r = requests.get(url)
  page_soup = soup(r.content, "lxml")

  review_container = page_soup.findAll("div", {"class": "review-
  content"})

  for review in review_container:
      string = str(review.p.text)
      stars = float(review[i].select('img')[0]['alt'].split()[0])
      worksheet.write(row, 0, stars)
      worksheet.write(row, 1, string)
    row += 1
    i += 1

  #Advance counter in order to scrape the next url for the restaurant    
  page_num += 20

workbook.close()

运行此脚本时出现的问题是我收到以下错误：

-----------------------------------------------------------------------
----
KeyError                                  Traceback (most recent call last)
<ipython-input-18-a73ecb4ef119> in <module>()
 38     for review in review_container:
 39         string = str(review.p.text)
---> 40         stars = float(review[i].select('img')[0]['alt'].split()[0])
 41         worksheet.write(row, 0, stars)
 42         worksheet.write(row, 1, string)

//anaconda/lib/python3.5/site-packages/bs4/element.py in 
__getitem__(self, key)
956         """tag[key] returns the value of the 'key' attribute for the tag,
957         and throws an exception if it's not there."""
--> 958         return self.attrs[key]
959 
960     def __iter__(self):

KeyError: 0

我理解导致代码的行是下面的代码：

stars = float(review[i].select('img')[0]['alt'].split()[0])

但是，我不太了解如何纠正错误以使脚本正常工作。

为了让脚本正常工作，我需要在代码中进行哪些更改？

Answer 1

我相信它应该只是

ImportError: libcudart.so.7.5: cannot open shared object file: No 
such file or directory

使用Python和BeautifulSoup时获取KeyError

1 个答案: