Question

我正在从网上收集评论。有些产品有多页评论;其他人只有一页。在这里的一些人的帮助下，我编写了一个代码，基本上可以让刮刀点击＆＃34;下一页＆＃34;有链接时链接。

我的问题是，当只有一页评论时，没有点击链接，刮刀一直在等待。我希望该程序能够查看下一页链接是否存在：如果存在，请单击它，如果没有，请返回循环顶部。

这是我的代码：

for url in list_urls:
  while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html)

#See if the "next page" link exists: if it does not, go back to the top of the loop
    href_test = soup.find('div', id='company_reviews_pagination')
    if href_test == None:
       break

#If next-page link exists, click on it
    elif href_test != None:
       last_link = soup.find('div',id='company_reviews_pagination').find_all('a')[-1]
       if last_link.text.startswith('Next'):
          next_url_parts = urllib.parse.urlparse(last_link['href'])
          url = urllib.parse.urlunparse(#code to define the "next-page" url - that part works!)
       else:
          break

到目前为止，它并没有给我错误，但程序没有运行，它一直在等待。我究竟做错了什么？我应该试试＆＃34;尝试＆＃34;声明专门处理此异常？

非常感谢提前。任何指导都非常感谢。

Answer 1

所以这就是我修复它的方法。我没有玩“如果链接存在条件”，而是使用了try / except：

    try:
       last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
       if last_link.text.startswith('Next'):
         next_url_parts = urllib.parse.urlparse(last_link['href'])
         url = urllib.parse.urlunparse(#code to find the next-page link )

       else:
         break
    except :
       break

当某些页面不存在href时，通过网页循环

1 个答案: