返回抓取

时间:2017-11-20 14:24:35

标签: python python-3.x web-scraping

我正在试图抓住INDEED:COM。我需要python来返回与作业“数据科学家”和城市“米兰”的研究相对应的结果数。我认为可以通过“提取页面中显示的结果数量”或通过计算搜索结果的数量(这是我在第1段中尝试做的)和2))来完成。 我第一次在生活中使用python,当这个简单的搜索是商务项目的起点时,我需要这个来完成一个项目。 你能帮我编程一下来返回结果数吗??? 非常感谢大家的帮助!!

##import something 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

##tell python what I am looking for 
URL="""https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20"""
page = requests.get(URL)
soup = BeautifulSoup(page.text,"html.parser")
#print(soup.prettify())

##extract the job tile (didnt work)
def extract_job_title_from_result(soup): 
 jobs = []
 for div in soup.find_all(name="div",attrs={"class":"row"}):
     for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
       jobs.append(a["title"])
 return(jobs)
output = extract_job_title_from_result(soup)
print (output)

### 1) count the results
URL_for_count = "https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20".format(query, location)
soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')
results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))


### 2) reiterate the search through the different pages of Indeed, to get ALL of the results 
##nober of results shown per page = 10
i = int(number_of_results/100)
    for page_number in range(i + 1):
        URL_for_results = "https://it.indeed.com/Milano,-Lombardia-offerte-lavoro-data-scientist".format(query, location, str(100 * page_number))
        soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
        results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})

1 个答案:

答案 0 :(得分:1)

您可以使用BeautifulSoup的find_all方法

from bs4 import BeautifulSoup as soup
import urllib
data = str(urllib.urlopen('https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20').read())
listing = soup(data, 'lxml')
jobs = [i.text[1:-1] for i in listing.find_all('h2')]
print(jobs)
print("number of jobs is: {}".format(len(jobs)))

输出:

[u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst']

number of jobs is: 10

编辑:获取前六页的数据:

final_data = [[b.text[1:-1] for b in soup(str(urllib.urlopen("https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start={}".format(10*i)).read()), "lxml").find_all('h2')] for i in range(6)]
lengths = list(map(len, final_data))
print(sum(lengths))

输出:

[[u'Data Scientist \u2013 Social Media Intelligence', u'DATA ANALYST', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Entry Specialist', u'Impiegato Data Entry', u'Data Scientist'], [u'Junior Data Scientist', u'DATA ANALYST JR \u2013 Milano', u'STAGE JUNIOR DATA ANALYST / DATA SCIENTIST BIG DATA', u'Machine Learning Scientist', u'Data Analyst', u'Data Analyst (Econometric modeling) Sede di Milano', u'Neolaureati in statistica, matematica, ingegneria-Data Scien...', u'Data Scientist', u'Data Scientist', u'Data Scientist'], [u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst'], [u'Collaboratori Data Entry', u'Data Scientist', u'DATA ENTRY', u'Consumer Data Scientist', u'DATA ANALYST', u'JUNIOR - RISK ADVISORY - TECHNOLOGY & DATA RISK - PRODUCTS &...', u'Data Manager Ematologia', u'Data Scientist', u'Esperto Tecnologie Big Data \u2013 Text Analysis \u2013 Data Mining', u'Data Entry'], [u'People Data Analyst', u'Data Integration Analyst \u2013 TIBCO', u'ORACLE BI - Big Data Analytics', u'Data Strategist', u'Data Governance Specialist', u'Big Data Specialist', u'Oracle Data Integrator Specialist', u'Innovation Analyst', u'Data Scientist', u'Big Data Engineer'], [u'JUNIOR BIG DATA ENGINEER', u'Junior Payment Analyst', u'Esperti BIG DATa e DWH', u'Data Warehouse Manager', u'Data Analyst', u'Big Data Engineer', u'data entry part time', u'Big Data & Datawarehouse Architect Location: Milano', u'Biomedical Signal/Image Processing Data Analyst', u'IT Big Data Engineer']]
[10, 10, 10, 10, 10, 10]
60