如何使用漂亮的汤从Shopee抓取数据

时间:2020-05-28 05:29:37

标签: python web-scraping beautifulsoup

我目前是一名学生,目前我正在学习Beautifulsoup,所以我的讲师会从商店提取数据,但是我无法提取产品的详细信息。目前,我正在尝试从https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales抓取数据。我只想刮擦产品的名称和价格。有人可以告诉我为什么我不能使用beautifulsoup抓取数据吗?

这是我的代码:

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

3 个答案:

答案 0 :(得分:2)

这个问题有点棘手(对于python初学者来说是 ),因为它涉及硒(用于无头浏览)和beautifulsoup(用于html数据提取)的组合。而且,由于文档对象模型(DOM)被包含在javascript中,因此该问题变得很困难。我们知道javascript是存在的,因为当仅使用beautifulsoup(例如for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'): print(item_n.get_text())

)访问时,我们会从网站上得到一个空响应

因此,要从具有脚本语言控制其DOM的网页中提取数据,我们必须使用硒进行无头浏览(这表明网站浏览器正在访问它)。我们还必须使用某种延迟参数(,该参数告诉网站它是人为访问的)。为此,硒库中的函数WebdriverWait()会有所帮助。

我现在提供解释该过程的代码片段。

首先,导入必需的库

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

接下来,初始化无头浏览器的设置。我正在使用Chrome。

# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                          options = chrome_options)
browser.get(base_url)
delay = 5 #secods

接下来,我声明一个空列表变量来保存数据。

# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
    try:
        WebDriverWait(browser, delay)
        print ("Page is ready")
        sleep(5)
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        #print(html)
        soup = BeautifulSoup(html, "html.parser")

        # find_all() returns an array of elements. 
        # We have to go through all of them and select that one you are need. And than call get_text()
        for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
            print(item_n.get_text())
            item_name.append(item_n.text)

        # find the price of items
        for item_c in soup.find_all('span', class_='_341bF0'):
            print(item_c.get_text())
            item_cost.append(item_c.text)

        # find initial item cost
        for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
            print(item_ic.get_text())
            item_init_cost.append(item_ic.text)
        # find total number of items sold/month
        for items_s in soup.find_all('div',class_ = '_18SLBt'):
            print(items_s.get_text())
            items_sold.append(item_ic.text)

        # find item discount percent
        for dp in soup.find_all('span', class_ = 'percent'):
            print(dp.get_text())
            discount_percent.append(dp.text)
        # find item location
        for il in soup.find_all('div', class_ = '_3amru2'):
            print(il.get_text())
            item_loc.append(il.text)

        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print ("Loading took too much time!-Try again")

此后,我使用zip函数来组合不同的列表项。

rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)

最后,我将此数据写入光盘,

import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)

作为一种好的做法,明智的做法是在任务完成后关闭无头浏览器。所以我将其编码为

# close the automated browser
browser.close()

结果

Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor

给读者的提示

OP引起我注意,xpath无法按照我的回答给出。 2天后我再次检查了网站,发现一个奇怪的现象。 class_类的div属性确实已更改。我找到了similar Q。但这并没有太大帮助。因此,目前为止,我认为shoppee网站中的div属性可以再次更改。我将其作为未解决的问题留待以后解决。

操作说明

Ana,以上代码仅适用于一页,即仅适用于网页https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales。我邀请您解决如何在sales标签下抓取多个网页的数据,从而进一步提高您的技能。您的提示是在页面右上角的1/9和/或页面底部的1 2 3 4 5链接。给您的另一个提示是查看urlparse库中的urljoin。希望这可以帮助您入门。

有用的资源

答案 1 :(得分:2)

在第一个请求通过ajax异步发送到页面之后,页面正在加载,因此似乎无法发送一个请求并获取所需页面的源。

您应该模拟一个浏览器,然后您可以获取源代码并可以使用beautifulsoup。查看代码:

BeautifulSoup方式

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')

for p in products:
    name = p.select('div[data-sqe="name"] > div')[0].get_text()
    price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
    product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
    print('name: ' + name)
    print('price: ' + price)
    print('product: ' + product + '\n')

但是,使用硒是获得所需一切的好方法。请参见下面的示例:

硒之路

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))

search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')

for p in products:
    name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
    price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
    product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
    print('name: ' + name)
    print('price: ' + price.replace('\n', ' | '))
    print('product: ' + product + '\n')

答案 2 :(得分:0)

请发布您的代码,以便我们提供帮助。

或者您可以这样开始..:)

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg


my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()