抓取特定的Web数据

时间:2016-10-30 11:16:42

标签: python python-2.7

以下是我的代码:

from bs4 import BeautifulSoup

url = "https://www.seek.co.nz/jobs/in-new-zealand/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&companyID=&advertiserID=&advertiserGroup=&keywords=&page=3&displaySuburb=&seoSuburb=&where=All+New+Zealand&whereId=3001&whereIsDirty=false&isAreaUnspecified=false&location=3001&area=&nation=3001&sortMode=ListedDate&searchFrom=quick&searchType="
response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup(html, "lxml")
#print soup.prettify()
job_title = soup("a", {"class": "job-title"})

print job_title

我想从网站上获取所有职位名称。

我运行代码但结果是blank []。我尝试了find_all()的所有用法,但都没有用。

我确信该网站包含了我需要的信息。

html

2 个答案:

答案 0 :(得分:0)

尝试打印html以查看是否有任何带有call_title调用的标签。我试过这样做,但没有找到任何。正如Martijn Pieters的评论中所建议的那样,浏览器开发者工具也显示了由javascript动态创建的DOM。

答案 1 :(得分:0)

试试这个:

import sys  
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebKitWidgets import QWebPage
from bs4 import BeautifulSoup

class Render(QWebPage):  

  app = QApplication(sys.argv)  

  def __init__(self, url):      
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_() 

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'https://www.seek.co.nz/jobs/in-new-zealand/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&companyID=&advertiserID=&advertiserGroup=&keywords=&page=3&displaySuburb=&seoSuburb=&where=All+New+Zealand&whereId=3001&whereIsDirty=false&isAreaUnspecified=false&location=3001&area=&nation=3001&sortMode=ListedDate&searchFrom=quick&searchType='  
r = Render(url)  
html = r.frame.toHtml() 
soup = BeautifulSoup(html, "lxml")   
job_title = soup.find("a", {"class": "job-title"})

print(job_title)

Out put:

<a class="job-title" data-bind="storeJobInformation: { currentPage: $root.pagination.currentPage, jobsCount: $root.jobs.jobs().length }, 
                                        html: name, 
                                        attr: { 
                                            target: !$root.onsiteSearch() ? '_self' : '_blank', 
                                            href: SEEK.searchResultsPage.jobDetailsActionUrl + '/' + id + '?pos=' + position + '&amp;type=' + adType() + '&amp;engineConfig=' + $root.jobs.engineConfig() + '&amp;userqueryid=' + $root.jobs.userQueryId() + '&amp;tier=' + (locationMatch === 'Exact' ? 'tier1' : (locationMatch === 'Nearby' ? 'tier2' : (locationMatch === 'Area' ? 'tier3' : 'no_tier'))) + '&amp;whereid=' + ($root.jobs.location().whereId || '')
                                        },
                                        click: $root.jobs.handleJoraAdClick" href="/job/32120592?pos=1&amp;type=promoted&amp;engineConfig=&amp;userqueryid=123949496807341226&amp;tier=no_tier&amp;whereid=3001" target="_self">Trade Assistant</a>