我目前是一名学生,目前我正在学习Beautifulsoup,所以我的讲师会从商店提取数据,但是我无法提取产品的详细信息。目前,我正在尝试从https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales抓取数据。我只想刮擦产品的名称和价格。有人可以告诉我为什么我不能使用beautifulsoup抓取数据吗?
这是我的代码:
from requests import get
from bs4 import BeautifulSoup
url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)
答案 0 :(得分:2)
这个问题有点棘手(对于python初学者来说是 ),因为它涉及硒(用于无头浏览)和beautifulsoup(用于html数据提取)的组合。而且,由于文档对象模型(DOM)被包含在javascript中,因此该问题变得很困难。我们知道javascript是存在的,因为当仅使用beautifulsoup(例如for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
因此,要从具有脚本语言控制其DOM的网页中提取数据,我们必须使用硒进行无头浏览(这表明网站浏览器正在访问它)。我们还必须使用某种延迟参数(,该参数告诉网站它是人为访问的)。为此,硒库中的函数WebdriverWait()
会有所帮助。
我现在提供解释该过程的代码片段。
首先,导入必需的库
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
接下来,初始化无头浏览器的设置。我正在使用Chrome。
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
接下来,我声明一个空列表变量来保存数据。
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
此后,我使用zip
函数来组合不同的列表项。
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
最后,我将此数据写入光盘,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
作为一种好的做法,明智的做法是在任务完成后关闭无头浏览器。所以我将其编码为
# close the automated browser
browser.close()
结果
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
给读者的提示
OP引起我注意,xpath无法按照我的回答给出。 2天后我再次检查了网站,发现一个奇怪的现象。 class_
类的div
属性确实已更改。我找到了similar Q。但这并没有太大帮助。因此,目前为止,我认为shoppee网站中的div属性可以再次更改。我将其作为未解决的问题留待以后解决。
操作说明
Ana,以上代码仅适用于一页,即仅适用于网页https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales
。我邀请您解决如何在sales标签下抓取多个网页的数据,从而进一步提高您的技能。您的提示是在页面右上角的1/9
和/或页面底部的1 2 3 4 5
链接。给您的另一个提示是查看urlparse库中的urljoin。希望这可以帮助您入门。
有用的资源
答案 1 :(得分:2)
在第一个请求通过ajax异步发送到页面之后,页面正在加载,因此似乎无法发送一个请求并获取所需页面的源。
您应该模拟一个浏览器,然后您可以获取源代码并可以使用beautifulsoup。查看代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')
for p in products:
name = p.select('div[data-sqe="name"] > div')[0].get_text()
price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
print('name: ' + name)
print('price: ' + price)
print('product: ' + product + '\n')
但是,使用硒是获得所需一切的好方法。请参见下面的示例:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')
for p in products:
name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
print('name: ' + name)
print('price: ' + price.replace('\n', ' | '))
print('product: ' + product + '\n')
答案 2 :(得分:0)
请发布您的代码,以便我们提供帮助。
或者您可以这样开始..:)
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg
my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()