用于循环迭代的Beautiful Soup 4中的奇怪错误

时间:2016-12-15 12:30:31

标签: python arrays list for-loop beautifulsoup

我正在尝试抓取一个在AJAX中加载其数据的网站。我希望通过一系列我已列入列表的网址来执行此操作。我使用for循环迭代。这是我的代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import pdb

listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya','https://www.flipkart.com/samsung-galaxy-on8-gold-16-gb/p/itmemvarkqg5dyay']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)

for url in listUrls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    labels = soup.findAll('li', {'class':"_1KuY3T row"})
    print labels

当我运行此代码时,我得到第一个URL的结果,但第二个显示空白列表。我尝试为这两个URL打印汤,但是有效。只有在我打印标签时,错误仍然存​​在。第一个URL的标签被打印,但第二个列表为空。

[<truncated>...Formats</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">MP3</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Capacity</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">3300 mAh</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Type</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Li-Ion</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Width</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">75 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Height</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">151.7 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
[]

Image:Result when I print labels in a loop

我使用了交互式调试模块pdb来进一步调试,发生了一件奇怪的事情 - 当我在打印标签之前添加堆栈跟踪并逐步执行循环时,它也会打印第二个URL的标签列表。

for url in listUrls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source, "html.parser") 
    labels = soup.findAll('li', {'class':"_1KuY3T row"})
    pdb.set_trace()
    print labels

...

[<truncated>..."vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(12)<module>()
-> for url in listUrls:
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(13)<module>()
-> browser.get(url)
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(15)<module>()
-> soup = BeautifulSoup(browser.page_source, "html.parser") #put all html in soup
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(16)<module>()
-> labels = soup.findAll('li', {'class':"_1KuY3T row"})
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(17)<module>()
-> pdb.set_trace()
(Pdb) 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(18)<module>()
-> print labels
(Pdb) n
[<li class="_1KuY3T row"><div class="vmXPri col col-3-12">Sales Package</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Handset, Adapter, Earphone, User Manual</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Number</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">J710FZDGINS</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Name</...<truncated>]

Image: Result when I run the code with stack trace

我还在循环中单独检查了每个URL,它运行正常。我是编程新手,现在我很茫然,非常感谢任何有关为什么会这样做的见解。谢谢!

1 个答案:

答案 0 :(得分:0)

调试时它的工作原理只是表明这是计时问题。当您逐步调试它时,您基本上会给页面加载更多时间,因此标签打印正确。

您需要做的是通过添加Explicit Wait来使事情更加可靠且具有预测性 - 等待页面上至少有一个标签:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ...

for url in listUrls:
    browser.get(url)

    # wait for labels to be present/rendered
    wait = WebDriverWait(browser, 20)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row")))

    soup = BeautifulSoup(browser.page_source, "html.parser")
    labels = soup.select("li._1KuY3T.row")
    print(labels)