我正在尝试从职位发布数据中抓取网站,并且输出看起来像这样:
[{'job_title':“初级数据科学家”,“ company”:“ \ n \ n BBC”, summary':“ \ n我们正在寻找一名初级数据科学家来 来与我们在伦敦的市场营销和受众团队合作。数据 科学团队负责设计...”,“链接”: 'www.jobsite.com',
'summary_text':“职位介绍\ n想象一下,是否将Netflix,《赫芬顿邮报》,ESPN和Spotify都整合为一个。...等
我想创建一个数据帧或CSV,如下所示:
现在,这是我正在使用的循环:
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
results = []
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '',
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
,使用循环后,我只打印结果。
在数据框中获取输出的一种好方法是什么?谢谢!
答案 0 :(得分:3)
查看pandas Dataframe API。您可以通过多种方式初始化数据框
您只需要将列表或字典附加到全局变量,就可以了。
results = []
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '', # assuming this has value like you shared in the example in your question
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
results.append(result)
# results is now a list of dictionaries
df= pandas.DataFrame(results)
另一个建议是,不要考虑将其转储到同一程序的数据帧中。首先将所有HTML文件转储到文件夹中,然后再次解析它们。这样,如果您需要以前未曾考虑过的页面上的更多信息,或者程序由于某些解析错误或超时而终止,则工作不会丢失。保持解析与爬网逻辑分开。
答案 1 :(得分:1)
我认为您需要定义页面数并将其添加到您的url中(确保您有一个占位符,表示我认为您的代码或其他答案都没有的值)。我是通过扩展您的URL来做到这一点的,以便在包含占位符的querystring中包含一个页面参数。
您的result
类选择器是否正确?您当然也可以使用for job in soup.select('.job'):
。然后,您需要定义适当的选择器以填充值。我认为获取每个页面的所有工作链接然后访问页面并从json中的字符串中提取值(例如页面中的字符串)更容易。添加Session
以重新使用连接。
需要明确等待以防止被阻止
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 3
with requests.Session() as s:
for page in range(1, pages + 1):
try:
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
links.append([link['href'] for link in soup.select('.job-title a')])
except Exception as e:
print(e, url )
finally:
time.sleep(2)
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
结果示例:
我无法想象您想要所有页面,但可以使用类似以下内容的东西:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 0
def get_links(url, page):
try:
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
page_links = [link['href'] for link in soup.select('.job-title a')]
if page == 1:
global pages
pages = int(soup.select_one('.page-title span').text.replace(',',''))
except Exception as e:
print(e, url )
finally:
time.sleep(1)
return page_links
with requests.Session() as s:
links.append(get_links('https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page=1',1))
for page in range(2, pages + 1):
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
links.append(get_links(url, page))
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )