我正在尝试从沃尔玛网站上删除一些数据用于研究
https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo
我想要刮掉所有产品类别。每个产品类别都有此容器html
<div class="TempoCategoryTileV2-tile"><img alt="" aria-hidden="true" tabindex="-1" itemprop="image" src="//i5.walmartimages.com/dfw/4ff9c6c9-deda/k2-_c3162a27-dbb6-46df-8b9f-b5b52ea657b2.v1.jpg?odnWidth=168&odnHeight=210&odnBg=ffffff" class="TempoCategoryTileV2-tile-img display-block">
<div class="TempoCategoryTileV2-tile-content-one text-center">
<div class="TempoCategoryTileV2-tile-linkText">
<div style="overflow: hidden;">
<div>Toyland</div>
</div>
</div>
</div><a class="TempoCategoryTileV2-tile-overlay" id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-tileLink-1" aria-label="Toyland" href="/cp/toys/4171?povid=14503+%257C+contentZone12+%257C+2017-11-01+%257C+1+%257C+HP+FC+Toys" data-uid="zir3SFhh" tabindex="" data-tl-id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-categoryTile-1-link" style="background-image: url("about:blank");"></a></div>
我想得到的是每个类别的文本和图像,所以我使用了这个python脚本
Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'
uClient = ''
while uClient == '':
try:
start = time.time()
uClient = requests.get(Walmarthome)
print("Relax we are getting the data...")
except:
print("Connection refused by the server..")
print("Let me sleep for 7 seconds")
print("ZZzzzz...")
time.sleep(7)
print("Was a nice sleep, now let me continue...")
continue
page_html = uClient.content
# close client
uClient.close()
page_soup = soup(page_html, "html.parser")
productcategories =page_soup.find_all("div", {"class": "TempoCategoryTileV2 Grid-col u-size-1-2 u-size-1-3-s u-size-1-4-m u-size-1-5-l u-size-1-6-xl"})
print(productcategories)
for categorycontainer in productcategories:
categorycard = categorycontainer.find("div", {"class": "TempoCategoryTileV2-tile-linkText"})
if categorycard is not None:
print("getting link")
print(categorycard)
但是当我运行它时,我得到的就是这些
"Relax we are getting the data..."
[]
由于某种原因,它没有从页面获取内容。我做错了什么,如何解决这个问题?
答案 0 :(得分:1)
该页面的项目是动态生成的,因此您需要使用任何浏览器模拟器来捕获它。试试这个。
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'
driver.get(Walmarthome)
page = driver.find_element_by_tag_name('body')
for i in range(3):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for item in soup.select(".TempoCategoryTileV2-tile"):
title = item.select(".TempoCategoryTileV2-tile-overlay")[0]['aria-label']
image = item.select("[itemprop='image']")[0]['src']
print(title,image)