BeautifulSoup模块未检测到任何标签

时间:2019-02-15 19:09:02

标签: python python-3.x web-scraping beautifulsoup html-parsing

我是python的新手,正在尝试做一个项目,在该项目中我打开了亚马逊产品页面上的所有评论链接。为什么import webbrowser, requests, sys, bs4, logging logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - % (levelname)s - %(message)s') print("Searching...") # Text to display while searching amazon headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'} url = input("Enter the url: ") res = requests.get(url, headers=headers) res.raise_for_status() # Retrieve reviews found soup = bs4.BeautifulSoup(res.text, features='html.parser') # Open a tab for each review found linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review- title.a-color-base.a-text-bold') numOpen = min(5, len(linkElems)) logging.debug(linkElems) for i in range(numOpen): logging.debug("Link is: " + str(linkElems[i].get('href'))) webbrowser.open('https://amazon.com' + linkElems[i].get('href')) 方法不能为python链接找到正确的 html标记

#include <Rcpp.h>
#include <omp.h>

// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix my_matrix(int I, int J, int nthreads) {
  NumericMatrix A(I,J);
  int i,j,tid;
  omp_set_num_threads(nthreads);
#pragma omp parallel for private(i, j, tid)
  for(int i = 0; i < I; i++) {
    for(int j = 0; j < J; j++) {
      tid = omp_get_thread_num();
      A(i,j) = tid ;
    }
  }

  return A;
}


/*** R
set.seed(42)
  my_matrix(10,10,5)
*/
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    0    0    0    0    0    0    0     0
 [2,]    0    0    0    0    0    0    0    0    0     0
 [3,]    1    1    1    1    1    1    1    1    1     1
 [4,]    1    1    1    1    1    1    1    1    1     1
 [5,]    2    2    2    2    2    2    2    2    2     2
 [6,]    2    2    2    2    2    2    2    2    2     2
 [7,]    3    3    3    3    3    3    3    3    3     3
 [8,]    3    3    3    3    3    3    3    3    3     3
 [9,]    4    4    4    4    4    4    4    4    4     4
[10,]    4    4    4    4    4    4    4    4    4     4

我希望这段代码能够生成并打开一个产品评论链接列表,但是当我运行它时,找到的标签列表变成空白。

2 个答案:

答案 0 :(得分:0)

更新1: 当op编辑他的帖子并以他的代码格式修复空白时,我正在更新我的答案。

当程序要求链接时,在按Enter键之前写(或粘贴)您的链接并添加一个空格。

输入末尾的多余空格将阻止IDE在浏览器窗口中打开链接,而不是用回车键结束输入,因此它将按预期执行,以在输入功能后执行下一个代码。

这样做,正如我在第一个答案中所演示的那样,您的代码实际上可以正常工作。

我先前的答案: 我注意到您在“ a-color-base”处有不正确的空格

替换此行:

linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review-title.a- color-base.a-text-bold')

使用

linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review-title.a-color-base.a-text-bold')

额外: 同样,当前您的代码仅与amazon.com一起使用,要使您的代码与amazon.in amazon.co.uk等其他亚马逊网站一起使用,您需要修改以下行:

webbrowser.open('https://amazon.com' + linkElems[i].get('href'))

类似于:

from urllib.parse import urlparse
url_components = urlparse(url)
webbrowser.open('https://' + url_components.netloc + linkElems[i].get('href'))

现在它可以与其他amazon网站(例如amazon.in)一起正常工作,不仅可以使用amazon.com,还可以尝试

测试驱动器:

Enter the url: https://www.amazon.in/Intex-PB-16K-Poly-16000mAH-Lithium/dp/B07843GH8X/ref=cm_cr_srp_d_product_top?ie=UTF8 
 2019-02-15 22:51:28,940 - DEBUG - Starting new HTTPS connection (1): www.amazon.in:443
 2019-02-15 22:51:30,125 - DEBUG - https://www.amazon.in:443 "GET /Intex-PB-16K-Poly-16000mAH-Lithium/dp/B07843GH8X/ref=cm_cr_srp_d_product_top?ie=UTF8%20 HTTP/1.1" 200 None
 2019-02-15 22:51:32,019 - DEBUG - [<a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R123ICSCDM2MF3/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Nice product.</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R3OTWE19SMPPVQ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Not a good product. It takes a day to charge the ...</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R1Z51ERFCD7D6P/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">not portable easily..</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R24GYC4HRBGTM1/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Five Stars</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R2P5ILE8KQF8PJ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Fine not superb</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R3C9ZYFRT9NWAK/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Quality &amp; Performance</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R16F4OE3LWHHQI/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Worst experience. Don’t buy</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R26ROATK8PU6TL/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B07843GH8X">Very bad product</a>]
 2019-02-15 22:51:32,019 - DEBUG - Link is: /gp/customer-reviews/R123ICSCDM2MF3/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
 2019-02-15 22:51:32,295 - DEBUG - Link is: /gp/customer-reviews/R3OTWE19SMPPVQ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
 2019-02-15 22:51:32,628 - DEBUG - Link is: /gp/customer-reviews/R1Z51ERFCD7D6P/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
 2019-02-15 22:51:32,933 - DEBUG - Link is: /gp/customer-reviews/R24GYC4HRBGTM1/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
 2019-02-15 22:51:33,302 - DEBUG - Link is: /gp/customer-reviews/R2P5ILE8KQF8PJ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X

如果您的调试功能抛出错误: 我不确定您的实际代码中的空间还是此处显示的格式不正确,但是无论如何,您应该从arg'%(levelname)s'格式中删除空格,以便调试能够按预期工作:

替换此:

logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %  (levelname)s - %(message)s')

使用

logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')

希望这会对您有所帮助。

答案 1 :(得分:0)

已提到空白问题。但是,您使用的是一个较长的选择器,因此速度较慢,它可能也更脆弱。您可以使用性能更高的

linkElems = soup.select("a.review-title")

甚至更快

linkElems = soup.select(".review-title")

类选择器是仅次于id的第二快的选择器方法。