无法抓取Google Adsense

时间:2015-07-15 15:15:40

标签: python selenium-webdriver web-scraping google-adwords

我正在尝试抓取一个网站,并希望从Google AdSense获取网址和图片。但似乎我没有得到谷歌Adsense的任何细节。

我想要
如果我们搜索"冰箱"在谷歌然后我们将获得一些我需要获取的广告。或者一些博客,网站显示Google广告,如图片

enter image description here

enter image description here

但是当我检查时,我可以找到相关的div和url但是当我点击url然后我只获得静态html数据。

这是我需要获取的代码

Screenshot from google search

这是我用Selenium,Python编写的脚本。

from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "http://www.compiletimeerror.com/"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    browser.get(url) # load page
    delay = 10 # seconds
try:
    WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[@class='pla-unit'])[0]")))
    print "Page is ready!"
    Element=browser.find_element(By.ID,value="google_image_div")
    print Element
    print Element.text
except TimeoutException:
    print "Loading took too much time!"

但我仍然无法获取数据。请给我任何参考或提示。

2 个答案:

答案 0 :(得分:1)

您需要先选择包含您要使用的元素的框架。

select_frame("id=google_ads_frame1");

注意:我不确定python语法。但它应该与此类似。

答案 1 :(得分:1)

在选择switch_to.frame变量(未经测试)之前,使用Selenium的browser方法将iframe引导至html中的element

from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "http://www.compiletimeerror.com/"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    browser.get(url) # load page
    delay = 10 # seconds
try:
    WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[@class='pla-unit'])[0]")))
    print "Page is ready!"
    browser.switch_to.frame(browser.find_element_by_id('google_ads_frame1'))
    element=browser.find_element(By.ID,value="google_image_div")
    print element
    print element.text
except TimeoutException:
    print "Loading took too much time!"

http://elementalselenium.com/tips/3-work-with-frames

关于Python style最佳实践的注释:在声明局部变量时使用小写(元素与元素)。