我的目标是从弹出窗口中提取页面内容。以前我只是使用PhantomJS进行动态抓取,它适用于大多数网站,但很少有网站使用AngularJS进行弹出窗口。 在这种情况下,动态抓取不起作用,并且所有这些都在爆发标签中进行了ng-click。
我首先尝试找出ng-click标签,然后在执行一些过滤后,我得到了包含ng-label的标签。 我使用该标签和ng-click值来使用css选择器查找元素,这对我来说是最好的选择,因为我需要在大范围内自动执行此操作。
我的代码与firefox工作正常但没有使用phantomjs.It无法找到css路径
def __init__(self):
# self.driver = webdriver.Firefox()
self.driver = webdriver.PhantomJS(desired_capabilities=cap)
# self.driver.implicitly_wait(90)
# self.driver.set_page_load_timeout(90)
# self.driver.set_script_timeout(90)
self.driver.set_window_size(1120, 1000)
def get_data(self,link):
soup = None
tag_list = []
try:
self.driver.get(link)
soup = BeautifulSoup(self.driver.page_source)
elements = self.driver.find_elements_by_xpath("//*[(@ng-click)]")
for element in elements:
attr_value = element.get_attribute("ng-click")
tag_name = element.tag_name
tags = soup.find_all(tag_name,{'ng-click':attr_value})
for tag in tags:
try:
tag_text = str(tag)
if "login" in (tag.text).lower() or "log in" in (tag.text).lower() or "signin" in (tag.text).lower() or "sign in" in (tag.text).lower():
# if "log in" in tag_text.lower() or "login" in tag_text.lower():
comp_class_ = element.get_attribute('class')
class_ = comp_class_.replace(" ",".")
css_path = str(tag_name+"."+class_)
element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, css_path)))
element.click()
# self.driver.find_element_by_css_selector(css_path).click()
tag_list.append(tag)
except Exception as e:
pass
print (self.driver.page_source).encode('utf-8')
except Exception as e:
print "Error while Scrapping {}".format(link)
return soup