抓取每页的数据,但是有重复项

时间:2019-05-05 23:26:43

标签: python web-scraping beautifulsoup

我正在尝试抓取glassdoor职位发布。目标是遍历5个页面并刮取每个页面中的所有数据。每个页面上有30个职位发布。当在页面上抓取职位发布时,它将抓取很多重复项,我在时间上一定程度上克服了这一点。sleep(3),我注意到随着我增加了睡眠时间,重复项减少了。这是我抓取每个页面的代码。

注意:以前创建了空列表。

def scrape():
    # Getting html of first page
    html = browser.html
    soup = BeautifulSoup(html, "html.parser")
    jobs = soup.find_all("li", class_="jl")

    for job in jobs:
        # Store all info into a list         
        position.append(job.find("div", class_="jobTitle").a.text)
        # ex: Tommy - Singapore
        comp_loc = job.find("div", class_="empLoc").div.text
        comp, loc = comp_loc.split("–")
        # get rid of trailing white space then append to company list.
        company.append(comp.strip())

        location.append(loc.strip())
        browser.click_link_by_href(job.find("a", class_="jobLink")["href"])

        # ------------- Scrape Job descriptions within a page -----------
        # from current html since if you click job_posting it render new html
        html = browser.html
        soup = BeautifulSoup(html, "html.parser")
        job_desc.append(soup.find("div", class_="desc").text)
        # It is because if you are going too fast it skips some jobs desc.         
        time.sleep(3)

在一页上,我得到了大部分的职位空缺。但是,当我循环浏览5页时,总共有210条数据,当我打印完set的数据时,它得到了140条数据。

这是运行抓取功能的代码:

# ------------- loop through pages only up to 5th. ------------

html = browser.html
soup = BeautifulSoup(html, "html.parser")
# Grab ul tag.
result = soup.find("div", class_="pagingControls").ul
# From grabbed ul, get all lists.
pages = result.find_all("li")

print(pages)

# creating storage for storing data retrieved from scrape() function
position = []
exp_level = []
company = []
employment_type = []
location = []
job_desc = []

# Loop through each list => each page
for page in pages:
    # scrape all job posting data in current page    
    scrape()

    # run if <a> exists since un-clickables do not have <a> skipping < and pg1     
    if page.a:
        # within <a> tag click except next button
        # "Next" is when you are finished with first 5 and want to goto next set
        # of page lists. 
        if not page.find("li", class_="Next"):

            # continue until you hit list with "Next" class, then stop.            
            try:
                browser.click_link_by_href(page.a['href'])

            except:
                print("This is the last page")

我认为如果我增加睡眠时间会更好,但是我想知道是否还有其他选择。谢谢!

1 个答案:

答案 0 :(得分:0)

好,这应该工作。它将所有参数放入一个列表(出),然后将该列表放入另一个列表(thejobs)。

然后,它过滤重复项并返回包含唯一元素(thejobs2)的列表。

thejobs2看起来像这样:

router.route('/path/:id').get((req, res) => {
    Test.find({})
    .populate({path: 'child1', select: 'data child2', populate: {path: 'child2', select: '_id name'}})
    .where('_id')
        .in('child1')
        .in('child2')
        .equals(req.params.id)
    .exec((err, aval) => {
        if (err)
            console.log(err);
        else{
            console.log(aval);  
        }  
    });
});

这不是最干净的方法,但它应该可以工作:

[
    ["position", "company", "location", "job_desc"], 
    ["position", "company", "location", "job_desc"], 
    ["position", "company", "location", "job_desc"], 
    ["position", "company", "location", "job_desc"]
    ...
    ...
]