从页面上刮取图像,网址,描述

时间:2015-06-27 12:25:51

标签: python selenium web-scraping webdriver phantomjs

我正在尝试从https://www.google.com/trends/home/all/IN

获取图片和视频网址

以下是代码:

driver = webdriver.PhantomJS('/usr/local/bin/phantomjs')
driver.set_window_size(1124, 850)
driver.get("https://www.google.com/trends/home/all/IN")
trend = {}
def getGooglerends():
    try:
    #Does this line makes any sense
        #element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('md-list-block ng-scope'))
        for s in driver.find_elements_by_class_name('md-list-block ng-scope'):
            print s.find_element_by_tag_name('img').get_attribute('src')
            print s.find_element_by_tag_name('img').get_attribute('alt')
            print s.find_elements_by_class_name('image-wrapper ng-scope').get_attribute('href')
    except:
        getNDTVTrends()
getGooglerends()

给出了

WebDriverException: Message: {"errorMessage":"Compound class names not permitted","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"111","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:57213","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"class name\", \"sessionId\": \"648251c0-1cc7-11e5-bf1c-4ff79ddbdce4\", \"value\": \"md-list-block ng-scope\"}","url":"/elements","urlParsed":{"anchor":"","query":"","file":"elements","directory":"/","path":"/elements","relative":"/elements","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/elements","queryKey":{},"chunks":["elements"]},"urlOriginal":"/session/648251c0-1cc7-11e5-bf1c-4ff79ddbdce4/elements"}}
Screenshot: available via screen

对此错误的任何建议?

1 个答案:

答案 0 :(得分:1)

  

不允许使用复合类名称

它基本上意味着您的班级名称中不能包含空格。你需要切换到另一个选择器,css,xpath或类似的东西。

不确定您要选择的是什么,但是例如在xpath之后选择包含该类的项目列表:

//div[@class="homepage-trending-stories generic-container ng-scope"]/md-list[@class="md-list-block ng-scope"]