我是python的新手,正在尝试做一个项目,在该项目中我打开了亚马逊产品页面上的所有评论链接。为什么import webbrowser, requests, sys, bs4, logging
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - % (levelname)s - %(message)s')
print("Searching...") # Text to display while searching amazon
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71
Safari/537.36'}
url = input("Enter the url: ")
res = requests.get(url, headers=headers)
res.raise_for_status()
# Retrieve reviews found
soup = bs4.BeautifulSoup(res.text, features='html.parser')
# Open a tab for each review found
linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review-
title.a-color-base.a-text-bold')
numOpen = min(5, len(linkElems))
logging.debug(linkElems)
for i in range(numOpen):
logging.debug("Link is: " + str(linkElems[i].get('href')))
webbrowser.open('https://amazon.com' + linkElems[i].get('href'))
方法不能为python链接找到正确的 html标记?
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix my_matrix(int I, int J, int nthreads) {
NumericMatrix A(I,J);
int i,j,tid;
omp_set_num_threads(nthreads);
#pragma omp parallel for private(i, j, tid)
for(int i = 0; i < I; i++) {
for(int j = 0; j < J; j++) {
tid = omp_get_thread_num();
A(i,j) = tid ;
}
}
return A;
}
/*** R
set.seed(42)
my_matrix(10,10,5)
*/
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0
[3,] 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1
[5,] 2 2 2 2 2 2 2 2 2 2
[6,] 2 2 2 2 2 2 2 2 2 2
[7,] 3 3 3 3 3 3 3 3 3 3
[8,] 3 3 3 3 3 3 3 3 3 3
[9,] 4 4 4 4 4 4 4 4 4 4
[10,] 4 4 4 4 4 4 4 4 4 4
我希望这段代码能够生成并打开一个产品评论链接列表,但是当我运行它时,找到的标签列表变成空白。
答案 0 :(得分:0)
更新1: 当op编辑他的帖子并以他的代码格式修复空白时,我正在更新我的答案。
当程序要求链接时,在按Enter键之前写(或粘贴)您的链接并添加一个空格。
输入末尾的多余空格将阻止IDE在浏览器窗口中打开链接,而不是用回车键结束输入,因此它将按预期执行,以在输入功能后执行下一个代码。
这样做,正如我在第一个答案中所演示的那样,您的代码实际上可以正常工作。
我先前的答案: 我注意到您在“ a-color-base”处有不正确的空格
替换此行:
linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review-title.a- color-base.a-text-bold')
使用
linkElems = soup.select('div.a-row a.a-size-base.a-link-normal.review-title.a-color-base.a-text-bold')
额外: 同样,当前您的代码仅与amazon.com一起使用,要使您的代码与amazon.in amazon.co.uk等其他亚马逊网站一起使用,您需要修改以下行:
webbrowser.open('https://amazon.com' + linkElems[i].get('href'))
类似于:
from urllib.parse import urlparse
url_components = urlparse(url)
webbrowser.open('https://' + url_components.netloc + linkElems[i].get('href'))
现在它可以与其他amazon网站(例如amazon.in)一起正常工作,不仅可以使用amazon.com,还可以尝试
。测试驱动器:
Enter the url: https://www.amazon.in/Intex-PB-16K-Poly-16000mAH-Lithium/dp/B07843GH8X/ref=cm_cr_srp_d_product_top?ie=UTF8
2019-02-15 22:51:28,940 - DEBUG - Starting new HTTPS connection (1): www.amazon.in:443
2019-02-15 22:51:30,125 - DEBUG - https://www.amazon.in:443 "GET /Intex-PB-16K-Poly-16000mAH-Lithium/dp/B07843GH8X/ref=cm_cr_srp_d_product_top?ie=UTF8%20 HTTP/1.1" 200 None
2019-02-15 22:51:32,019 - DEBUG - [<a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R123ICSCDM2MF3/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Nice product.</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R3OTWE19SMPPVQ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Not a good product. It takes a day to charge the ...</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R1Z51ERFCD7D6P/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">not portable easily..</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R24GYC4HRBGTM1/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Five Stars</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R2P5ILE8KQF8PJ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Fine not superb</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R3C9ZYFRT9NWAK/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Quality & Performance</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R16F4OE3LWHHQI/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Worst experience. Don’t buy</a>, <a class="a-size-base a-link-normal review-title a-color-base a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R26ROATK8PU6TL/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X">Very bad product</a>]
2019-02-15 22:51:32,019 - DEBUG - Link is: /gp/customer-reviews/R123ICSCDM2MF3/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
2019-02-15 22:51:32,295 - DEBUG - Link is: /gp/customer-reviews/R3OTWE19SMPPVQ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
2019-02-15 22:51:32,628 - DEBUG - Link is: /gp/customer-reviews/R1Z51ERFCD7D6P/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
2019-02-15 22:51:32,933 - DEBUG - Link is: /gp/customer-reviews/R24GYC4HRBGTM1/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
2019-02-15 22:51:33,302 - DEBUG - Link is: /gp/customer-reviews/R2P5ILE8KQF8PJ/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07843GH8X
如果您的调试功能抛出错误: 我不确定您的实际代码中的空间还是此处显示的格式不正确,但是无论如何,您应该从arg'%(levelname)s'格式中删除空格,以便调试能够按预期工作:
替换此:
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - % (levelname)s - %(message)s')
使用
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')
希望这会对您有所帮助。
答案 1 :(得分:0)
已提到空白问题。但是,您使用的是一个较长的选择器,因此速度较慢,它可能也更脆弱。您可以使用性能更高的
linkElems = soup.select("a.review-title")
甚至更快
linkElems = soup.select(".review-title")
类选择器是仅次于id的第二快的选择器方法。