Question

我一直试图从这个网址中抓取数据：http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa&txt_statelist=&txt_state=&ERR_LS_20161018_041816_21233=txt_statelist%7CLocation%7C20%7C0%7C%7C0一天中的大部分时间 - 并且知道随着时间的推移，我的效率非常低。我刚刚学会了抓住普通的html网站，似乎已经掌握了它。 javascript驱动的一直被证明是痛苦的。

到目前为止我一直在研究的刮刀 - 在接近问题的许多角度之后产生了相同的结果。以下是我正在使用的代码：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

PHANTOMJS_PATH = './phantomjs.exe'

#Using PhantomJS to navigate the url
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa&txt_statelist=&txt_state=&ERR_LS_20161018_041816_21233=txt_statelist%7CLocation%7C20%7C0%7C%7C0')

wait = WebDriverWait(browser, 15)
# let's parse our html
soup = BeautifulSoup(browser.page_source, "html5lib")

# get all the games
test = soup.find_all('tr')

print test

我最大的问题是我无法获得我正在寻找的细节。在下图中： Highlighted field

我无法获取与该特定名称相关的网址。获取URL后，我想进一步导航到用户以获取更多详细信息。

所以我的问题如下：

是否有更有效的方式以编程方式返回您要查找的数据（给定时间有限）。
有没有更好的方法来查看您在抓取时如何浏览javascript生成的网站？
如果我需要更清晰，请告诉我。

谢谢！

第2部分：

我采取了另一种方法，并遇到了另一个问题。

我尝试使用以下内容获取上面的代码：

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Chrome()  
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
comments = soup.findAll('a')  
print comments

在我正在打印的“评论”列表中，我正在寻找的特定元素没有出现。即。

<a href="/members/?id=35097829" id="MiniProfileLink_35097829" onmouseover="MiniProfileLink_OnMouseOver(35097829);" onmouseout="HideMiniProfile();" target="_top">Namir Abraham</a>

然后我试图使用selenium功能：

来自selenium import webdriver
来自selenium.common.exceptions import NoSuchElementException
来自selenium.webdriver.common.keys导入密钥
来自bs4 import BeautifulSoup

browser = webdriver.Chrome('C:/Users/rschilder/Desktop/Finance24 Scrape/Accountant_scraper/chromedriver.exe')
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
browser.implicitly_wait(30)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#browser.quit()

print html

我对此的挑战是：

我不太确定如何使用selenium get功能搜索和获取特定元素（它不像美丽的汤那么直观）
即使使用selenium导航 - 我正在寻找的元素（如上所述）仍然没有出现在输出中？

无法使用python

0 个答案: