我正在抓住房地产数据。在使用javascript生成的网站上,Selenium做了一项出色的工作:您可以找到包含相关信息的标记,并使用
遍历所有标记。driver.find_elements_by...
但是在这个site上,列表是由角度js产生的。我尝试了同样的方法:
for article in driver.find_elements_by_css_selector("div.property.ng-scope"):
do something
我发现我必须让我的webdriver(phantomJS)点击通往各个列表网站的链接:
linkbase = article.find_element_by_css_selector("div.info.clear.ng-scope")
link = linkbase.find_element_by_tag_name('a')
link.click()
然后webdriver只是指向该网站,我可以获得我想要的所有信息一个列表。
一旦循环结束,我就会收到以下错误:
> Message: {"errorMessage":"Element does not exist in cache","request":{"headers":
{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","
Content-Length":"142","Content-Type":"application/json;charset=UTF-8","Host":"12
7.0.0.1:56577","User-Agent":"Python-urllib/3.4"},"httpVersion":"1.1","method":"P
OST","post":"{\"sessionId\": \"f9ec2c10-dfd9-11e5-9d4c-3bbe8f5bf7c0\", \"using\"
: \"css selector\", \"id\": \":wdc:1456856343349\", \"value\": \"div.info.clear.
ng-scope\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"elemen
t","directory":"/","path":"/element","relative":"/element","port":"","host":"","
password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/ele
ment","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/f9ec2c10-dfd9-
11e5-9d4c-3bbe8f5bf7c0/element/:wdc:1456856343349/element"}}
页面上包含链接的元素是:
<a ng-href="/detail/prodej/dum/rodinny/jemnice-jemnice-/3800125532" ng-click="beforeOpen(i.iterator, i.regionTip)" class="title" href="/detail/prodej/dum/rodinny/jemnice-jemnice-/3800125532">
<span class="name ng-binding"> ... </a>
这只是每个商家信息的标题文字。我确实在this answer之后设置了一个用户代理,即使它没有出现在错误中。我也在等待加载周围元素之前等待:
wait = WebDriverWait(driver, getSearchResults_CZ.waiting)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.content")))
我想要的是解析所有这些属性元素,将它们的链接保存到列表然后遍历列表,打开每个链接 driver.get()我知道通过单击链接,驱动程序URL更改,但我认为,一旦使用 find_elements_by 建立了文章列表,它将作为稳定的参考点。通过搜索“a”标记访问链接并调用 get_attribute('href')在这种情况下使用angular js框架不起作用。我没看到什么?
编辑: 如上所述,没有.click()的get_attribute是正确的方法。我原来的错误与CSS选择器有关:我一直在使用“div [class ^ ='property']”并得到一个完全不同的链接。一定找到了我以前没见过的另一个元素。
答案 0 :(得分:1)
等待至少一个&#34;属性&#34; 可见,然后抓住链接:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://www.sreality.cz/hledani/prodej/domy?region=jemnice")
driver.maximize_window()
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "property")))
links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector("div.property div.info a")]
print(links)
driver.close()
适合我。