我正在尝试抓取求职门户的网页。我想检索职位,职位和公司。之后,我想将所有这些信息编译为一个excel文件。
我已通过以下方式进行处理: 该页面有23页,每页有50个帖子。 我从页面中检索所有标题,位置和公司,并将它们放在1x50列表中。然后,我将这三个列表编译成3x50的熊猫数据框。然后,我将转到下一页并重复该过程,直到获得一个数据框,然后再将此数据框添加到现有数据框中,从而创建一个3x100数据框。我将对所有23页执行此操作,最后将1150x3数据帧打印为excel。
这是我当前的代码
import requests
import bs4 as bs
import pandas as pd
#
headers = ['Job_title', 'Company_name', 'Location']
search_page ='a website'
page_nr = 0
stop = 5
for i in range(stop+1):
indexer = search_page + str(page_nr)
print (str(indexer))
results = requests.get(indexer)
cont = results.content
comp_names = []
soup = bs.BeautifulSoup(cont, 'html.parser')
# job location
loc_list = []
loc_names = soup.find_all('li', class_='offer-location')
for loc_tag in loc_names:
l_tag = loc_tag.text
#print (l_tag)
loc_list.append(l_tag)
# company name
comp_names = []
company_names = soup.find_all('li', class_='offer-company')
for span_tag in company_names:
s_tag = span_tag.a.text
comp_names.append(s_tag)
# job title
title_names = []
for h2_tag in soup.find_all("h2"):
a_tag = h2_tag.a.text
title_names.append(a_tag)
# Compiling data to a dataframe
df_list = []
df = pd.DataFrame(zip(title_names, comp_names, loc_list), columns=headers)
#print (df)
df_list.append(df) # This does not work, only adds the last one to the list
page_nr += 1
# Writing to Excel
if i == stop:
df_finished = pd.concat(df_list)
print ("Concatenated")
df_finished.to_excel("output1.xlsx")
print ("Printed to Excel")