Question

下午好。

我目前正在解析此网站：http://uk.easyroommate.com/results-room/loc/981238/pag/1。

我希望获得每个广告的每个网址的列表。但是，此列表使用JavaScript编码。我可以通过Firefox firebug完美地看到它们，但我没有找到任何方法通过Python获取它们。我认为这是可行的，但我不知道如何。

编辑：显然我已尝试使用像BeautifulSoup这样的模块，但因为它是一个JavaScript生成的页面，所以它完全没用。

提前感谢您的帮助。

Answer 1

广告列表由JavaScript生成。 BeautifulSoup以此为例：

<ul class="search-results" data-bind="template: { name: 'room-template', foreach: $root.resultsViewModel.Results, as: 'resultItem' }"></ul>

我建议您查看：Getting html source when some html is generated by javascript和Python Scraping JavaScript using Selenium and Beautiful Soup。

Answer 2

感谢您的领导，这是解决方案，我希望有一天能帮到某人：

from selenium import webdriver  
from bs4 import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://uk.easyroommate.com/results-room/loc/981238/pag/1')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
print soup.prettify()
## You are now able to see the HTML generated by javascript code and you 
## can extract it as usual using BeautifulSoup

for el in soup.findAll('div', class_="listing-meta listing-meta--small"):
    print el.find('a').get('href')

在我的情况下，我只是想提取这些链接，但是一旦你通过Selenium获得了网页源代码，使用beautifulSoup并获得你想要的每一个项目都是小菜一碟。

从HTML中提取/解码CSS到Python

2 个答案: