如何在selenium webdriver -python scraping

时间:2018-05-17 19:51:57

标签: python selenium-webdriver

我正在努力抓http://quotes.toscrape.com/。它在一个页面上包含几个框,每个框包含一个引号,给出引用的人的姓名和该引用的标签。现在这就是我在使用python的selenium webdriver中所做的:

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/")
sleep(2)
all_boxes = driver.find_elements_by_xpath(r"//div[@class='quote']")
for each in all_boxes:
    print(each.find_element_by_xpath('//span').text) // to print the quote

我在这里所做的事情非常简单易懂。我已经选择了该页面上的所有框,然后迭代每个框我尝试使用HTML结构中观察到的所需xpath打印每个框中包含的引用。但获得的产出并不是预期的。即使我在每个框中迭代,输出也只会打印第一个框中包含的引号。

输出结果为:

 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.”

我不能在这个非常具体的方法中找到问题。请告诉我这个方法出了什么问题,因为我非常了解使用pylen的selenium或beautifulsoup库进行刮擦的其他技术。我只是想知道为什么上面编码的方法不起作用。

2 个答案:

答案 0 :(得分:0)

要抓取网站http://quotes.toscrape.com/并提取引号,您必须构建一个定位器策略,它将识别网页上的所有引号,然后引发 WebDriverWait 所有要显示的元素并将它们存储在List中。最后,您可以使用text方法提取以下解决方案之后的所有文本:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("http://quotes.toscrape.com/")
    all_boxes = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='quote']/span[@class='text']")))
    for each in all_boxes:
        print(each.text)
    
  • 控制台输出:

    “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
    “It is our choices, Harry, that show what we truly are, far more than our abilities.”
    “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
    “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
    “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
    “Try not to become a man of success. Rather become a man of value.”
    “It is better to be hated for what you are than to be loved for what you are not.”
    “I have not failed. I've just found 10,000 ways that won't work.”
    “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
    “A day without sunshine is like, you know, night.”
    

答案 1 :(得分:0)

您的xpath在迭代下是错误的。您应该给出的是到您当前正在迭代的元素的相对路径,而不是整个文档的路径。所以代替

each.find_element_by_xpath('//span').text

放入

each.find_element_by_xpath('./span').text