用美丽的汤蟒蛇网刮不起作用

时间:2017-11-06 00:27:42

标签: python web-scraping

我正在尝试从沃尔玛网站上删除一些数据用于研究

https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo

我想要刮掉所有产品类别。每个产品类别都有此容器html

  <div class="TempoCategoryTileV2-tile"><img alt="" aria-hidden="true" tabindex="-1" itemprop="image" src="//i5.walmartimages.com/dfw/4ff9c6c9-deda/k2-_c3162a27-dbb6-46df-8b9f-b5b52ea657b2.v1.jpg?odnWidth=168&amp;odnHeight=210&amp;odnBg=ffffff" class="TempoCategoryTileV2-tile-img display-block">
<div class="TempoCategoryTileV2-tile-content-one text-center">
    <div class="TempoCategoryTileV2-tile-linkText">
        <div style="overflow: hidden;">
            <div>Toyland</div>
        </div>
    </div>
</div><a class="TempoCategoryTileV2-tile-overlay" id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-tileLink-1" aria-label="Toyland" href="/cp/toys/4171?povid=14503+%257C+contentZone12+%257C+2017-11-01+%257C+1+%257C+HP+FC+Toys" data-uid="zir3SFhh" tabindex="" data-tl-id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-categoryTile-1-link" style="background-image: url(&quot;about:blank&quot;);"></a></div>

我想得到的是每个类别的文本和图像,所以我使用了这个python脚本

 Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'
 uClient = ''
 while uClient == '':
         try:
             start = time.time()
             uClient = requests.get(Walmarthome)

             print("Relax we are getting the data...")

         except:
             print("Connection refused by the server..")
             print("Let me sleep for 7 seconds")
             print("ZZzzzz...")
             time.sleep(7)
             print("Was a nice sleep, now let me continue...")
             continue
 page_html = uClient.content
 # close client
 uClient.close()
 page_soup = soup(page_html, "html.parser")

 productcategories =page_soup.find_all("div", {"class": "TempoCategoryTileV2 Grid-col u-size-1-2 u-size-1-3-s u-size-1-4-m u-size-1-5-l u-size-1-6-xl"})
 print(productcategories)
 for categorycontainer in productcategories:
     categorycard = categorycontainer.find("div", {"class": "TempoCategoryTileV2-tile-linkText"})
     if categorycard is not None:
         print("getting link")
         print(categorycard)

但是当我运行它时,我得到的就是这些

 "Relax we are getting the data..." 
 []

由于某种原因,它没有从页面获取内容。我做错了什么,如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

该页面的项目是动态生成的,因此您需要使用任何浏览器模拟器来捕获它。试试这个。

import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'
driver.get(Walmarthome)
page = driver.find_element_by_tag_name('body')
for i in range(3):
    page.send_keys(Keys.PAGE_DOWN)
    time.sleep(2)

soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for item in soup.select(".TempoCategoryTileV2-tile"):
    title = item.select(".TempoCategoryTileV2-tile-overlay")[0]['aria-label']
    image = item.select("[itemprop='image']")[0]['src']
    print(title,image)