Urllib.request在python 3上不起作用。如何使用beautifulsoup?

时间:2019-01-22 20:51:59

标签: python web-scraping beautifulsoup

我正在尝试学习如何抓取网站,并且不断碰到urllib.request,这对我不起作用。

import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen('https://www.goat.com/collections/just-dropped').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)

3 个答案:

答案 0 :(得分:1)

尝试requests

import requests
import bs4 as bs
sauce = requests.get('https://www.goat.com/collections/just-dropped').text
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)

答案 1 :(得分:0)

您必须设置User-Agent标头,但不幸的是页面是动态内容,您必须使用硒

from urllib.request import Request, urlopen
import bs4 as bs

req = Request('https://www.goat.com/collections/just-dropped')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')
sauce = urlopen(req).read()

soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)

使用Selenium,要使用它,您需要安装Selenium,Chrome和chromedriver

pip install selenium
pip install chromedriver-binary

代码

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import chromedriver_binary  # Adds chromedriver binary to path

driver = webdriver.Chrome()
driver.get('https://www.goat.com/collections/just-dropped')

# wait until the product rendered
products = WebDriverWait(driver, 15).until(
    lambda d: d.find_element_by_css_selector('.goat-clean-product-template ')
)

for p in products:
    name = p.get_attribute('title')
    url = p.get_attribute('href')
    print('%s: %s' % (name, url))

答案 2 :(得分:0)

如前所述,您可以真正地使用requests库来获取页面内容。

首先,您必须通过requests安装bs4pip。这样可以解决您得到的ModuleNotFoundError

pip install bs4
pip install requests

然后他就是您获取数据的代码:

import requests 
from bs4 import BeautifulSoup
sauce = requests.get('https://www.goat.com/collections/just-dropped')
soup = BeautifulSoup(sauce.text, 'lxml')
print(soup)