我正在尝试抓取glassdoor职位发布。目标是遍历5个页面并刮取每个页面中的所有数据。每个页面上有30个职位发布。当在页面上抓取职位发布时,它将抓取很多重复项,我在时间上一定程度上克服了这一点。sleep(3),我注意到随着我增加了睡眠时间,重复项减少了。这是我抓取每个页面的代码。
注意:以前创建了空列表。
def scrape():
# Getting html of first page
html = browser.html
soup = BeautifulSoup(html, "html.parser")
jobs = soup.find_all("li", class_="jl")
for job in jobs:
# Store all info into a list
position.append(job.find("div", class_="jobTitle").a.text)
# ex: Tommy - Singapore
comp_loc = job.find("div", class_="empLoc").div.text
comp, loc = comp_loc.split("–")
# get rid of trailing white space then append to company list.
company.append(comp.strip())
location.append(loc.strip())
browser.click_link_by_href(job.find("a", class_="jobLink")["href"])
# ------------- Scrape Job descriptions within a page -----------
# from current html since if you click job_posting it render new html
html = browser.html
soup = BeautifulSoup(html, "html.parser")
job_desc.append(soup.find("div", class_="desc").text)
# It is because if you are going too fast it skips some jobs desc.
time.sleep(3)
在一页上,我得到了大部分的职位空缺。但是,当我循环浏览5页时,总共有210条数据,当我打印完set的数据时,它得到了140条数据。
这是运行抓取功能的代码:
# ------------- loop through pages only up to 5th. ------------
html = browser.html
soup = BeautifulSoup(html, "html.parser")
# Grab ul tag.
result = soup.find("div", class_="pagingControls").ul
# From grabbed ul, get all lists.
pages = result.find_all("li")
print(pages)
# creating storage for storing data retrieved from scrape() function
position = []
exp_level = []
company = []
employment_type = []
location = []
job_desc = []
# Loop through each list => each page
for page in pages:
# scrape all job posting data in current page
scrape()
# run if <a> exists since un-clickables do not have <a> skipping < and pg1
if page.a:
# within <a> tag click except next button
# "Next" is when you are finished with first 5 and want to goto next set
# of page lists.
if not page.find("li", class_="Next"):
# continue until you hit list with "Next" class, then stop.
try:
browser.click_link_by_href(page.a['href'])
except:
print("This is the last page")
我认为如果我增加睡眠时间会更好,但是我想知道是否还有其他选择。谢谢!
答案 0 :(得分:0)
好,这应该工作。它将所有参数放入一个列表(出),然后将该列表放入另一个列表(thejobs)。
然后,它过滤重复项并返回包含唯一元素(thejobs2)的列表。
thejobs2看起来像这样:
router.route('/path/:id').get((req, res) => {
Test.find({})
.populate({path: 'child1', select: 'data child2', populate: {path: 'child2', select: '_id name'}})
.where('_id')
.in('child1')
.in('child2')
.equals(req.params.id)
.exec((err, aval) => {
if (err)
console.log(err);
else{
console.log(aval);
}
});
});
这不是最干净的方法,但它应该可以工作:
[
["position", "company", "location", "job_desc"],
["position", "company", "location", "job_desc"],
["position", "company", "location", "job_desc"],
["position", "company", "location", "job_desc"]
...
...
]