无法从动态内容中获取某些链接

时间:2018-12-22 12:24:34

标签: python python-3.x selenium selenium-webdriver web-scraping

我用python与硒结合编写了一个脚本,以从其着陆页抓取位于地图旁边右侧区域的不同属性的链接。

Link to the landing page

当我从Chrome手动单击每个块时,在新选项卡中看到包含此​​/for_sale/部分的链接,而我脚本提取的内容包含/homedetails/

如何获取结果数量(例如153套待售房屋)以及指向属性的正确链接?

到目前为止,我的尝试:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.zillow.com/homes/33155_rb/"

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)

itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
    print(item.get_attribute("href"))
driver.quit()

当前输出之一:

https://www.zillow.com/homedetails/6860-SW-48th-Ter-Miami-FL-33155/44206318_zpid/

这样的预期输出之一:

https://www.zillow.com/homes/for_sale/Miami-FL-33155/house_type/44184455_zpid/72458_rid/globalrelevanceex_sort/25.776783,-80.256072,25.695446,-80.364905_rect/12_zm/0_mmm/

3 个答案:

答案 0 :(得分:2)

在分析/ homedetails /和/ for_sale /链接时,我发现/ homedetails /链接通常包含这样的代码:

  

44206318_zpid

该代码充当广告发布的唯一标识符,我将其提取并添加到:

  

https://www.zillow.com/homes/for_sale/

因此,广告帖子的最终链接将如下所示:

  

https://www.zillow.com/homes/for_sale/ 44206318_zpid

这是一个有效的链接,指向广告发布。

这是最后的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.zillow.com/homes/33155_rb/"

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)

itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
    link = item.get_attribute("href")
    if "zpid" in link:
        print("https://www.zillow.com/homes/for_sale/{}".format(link.split('/')[-2]))

我希望这会有所帮助。

答案 1 :(得分:0)

您可以遍历分页div,并保持每页显示的房屋数量的递增计数器。为了解析html,此答案使用了BeautifulSoup

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re, time
def home_num(_d:soup) -> int:
  return len(_d.find_all('a', {'href':re.compile('^/homedetails/')}))

d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.zillow.com/homes/33155_rb/')
homecount, _links = home_num(soup(d.page_source, 'html.parser')), []
_seen_links, _result_links = [], []
_start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
while _start:
  _new_start = _start[0]
  try:
     _new_start.send_keys('\n')
     time.sleep(5)
     _start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
  except:
    _seen_links.append(_new_start.get_attribute('href'))
    _start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
  else:
     _seen_links.append(_new_start.get_attribute('href'))
     _result_links.append(_new_start.get_attribute('href'))
     homecount += home_num(soup(d.page_source, 'html.parser'))

答案 2 :(得分:0)

  

如果您查看页面右侧显示的那些图像,您会看到“家庭细节”而不是“出售”。   只需尝试在新选项卡中打开链接,然后观察实际链接为“ homedetails”。