用beautifulsoup刮网页

时间:2017-12-20 17:02:16

标签: python-3.x beautifulsoup

我想获得下一个链接并提取其信息。

我收到了错误。

import urllib.request
from bs4 import BeautifulSoup
from random import randint
from bs4.dammit import EncodingDetector
import re
import sys


url=['https://fr.aliexpress.com/category/205000316/men-clothing-accessories.html'.format(i) for i in range(1, 10)] 
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0,Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0',Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0',Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; EN; rv:11.0) like Gecko',Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)',Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A'"

req = urllib.request.Request(url, headers = headers)
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html.decode('utf8', 'ignore'), "html.parser")


# retrive infos such product name, price , rating
products = soup.select('ul.son-list li.list-item')
for product in products:
    name = product.select_one("a.product").get_text()
    stars_element = product.select_one(".star")
    rating = stars_element["title"].split(": ")[1].strip().split(" ", 1)[0] if stars_element else "Unknown rating"

    print(name, rating)

0 个答案:

没有答案