如何从网站抓取的数据创建数据框?

时间:2019-04-04 05:23:02

标签: python pandas for-loop web-scraping export-to-csv

我正在尝试从职位发布数据中抓取网站,并且输出看起来像这样:

  

[{'job_title':“初级数据科学家”,“ company”:“ \ n \ n BBC”,   summary':“ \ n我们正在寻找一名初级数据科学家来   来与我们在伦敦的市场营销和受众团队合作。数据   科学团队负责设计...”,“链接”:   'www.jobsite.com',
      'summary_text':“职位介绍\ n想象一下,是否将Netflix,《赫芬顿邮报》,ESPN和Spotify都整合为一个。...等

我想创建一个数据帧或CSV,如下所示:

Expected Output

现在,这是我正在使用的循环:

for page in pages:
    source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
    soup = BeautifulSoup(source, 'lxml')

results = []
for jobs in soup.findAll(class_='result'):
    result = {
                'job_title': '',
                'company': '',
                'summary': '',
                'link': '',
                'summary_text': ''
            }

,使用循环后,我只打印结果。

在数据框中获取输出的一种好方法是什么?谢谢!

2 个答案:

答案 0 :(得分:3)

查看pandas Dataframe API。您可以通过多种方式初始化数据框

  • 词典列表
  • 列表列表

您只需要将列表或字典附加到全局变量,就可以了。

results = []
for page in pages:

      source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
      soup = BeautifulSoup(source, 'lxml')


      for jobs in soup.findAll(class_='result'):
          result = {
                'job_title': '', # assuming this has value like you shared in the example in your question
                'company': '',
                'summary': '',
                'link': '',
                'summary_text': ''
            }
           results.append(result)
      # results is now a list of dictionaries
df= pandas.DataFrame(results)

另一个建议是,不要考虑将其转储到同一程序的数据帧中。首先将所有HTML文件转储到文件夹中,然后再次解析它们。这样,如果您需要以前未曾考虑过的页面上的更多信息,或者程序由于某些解析错误或超时而终止,则工作不会丢失。保持解析与爬网逻辑分开。

答案 1 :(得分:1)

我认为您需要定义页面数并将其添加到您的url中(确保您有一个占位符,表示我认为您的代码或其他答案都没有的值)。我是通过扩展您的URL来做到这一点的,以便在包含占位符的querystring中包含一个页面参数。

您的result类选择器是否正确?您当然也可以使用for job in soup.select('.job'):。然后,您需要定义适当的选择器以填充值。我认为获取每个页面的所有工作链接然后访问页面并从json中的字符串中提取值(例如页面中的字符串)更容易。添加Session以重新使用连接。

需要明确等待以防止被阻止

import requests 
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 3

with requests.Session() as s:
    for page in range(1, pages + 1):
        try:
            url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
            source = s.get(url, headers = headers).text
            soup = bs(source, 'lxml') 
            links.append([link['href'] for link in soup.select('.job-title a')])
        except Exception as e:
            print(e, url )
        finally:
            time.sleep(2)

    final_list = [item for sublist in links for item in sublist]  

    for link in final_list:  
        source = s.get(link, headers = headers).text        
        soup = bs(source, 'lxml')
        data = soup.select_one('#jobPostingSchema').text #json like string containing all info
        item = json.loads(data)

        result = {
        'Title' : item['title'],
         'Company' : item['hiringOrganization']['name'],
         'Url' : link,
         'Summary' :bs(item['description'],'lxml').text
    }

        results.append(result)
        time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary']) 
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )

结果示例:

enter image description here


我无法想象您想要所有页面,但可以使用类似以下内容的东西:

import requests 
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 0

def get_links(url, page):
    try:
        source = s.get(url, headers = headers).text
        soup = bs(source, 'lxml') 
        page_links = [link['href'] for link in soup.select('.job-title a')]
        if page == 1:
            global pages
            pages = int(soup.select_one('.page-title span').text.replace(',',''))
    except Exception as e:
        print(e, url )
    finally:
        time.sleep(1)
    return page_links

with requests.Session() as s:

    links.append(get_links('https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page=1',1))

    for page in range(2, pages + 1):
        url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
        links.append(get_links(url, page))

    final_list = [item for sublist in links for item in sublist]  

    for link in final_list:  
        source = s.get(link, headers = headers).text        
        soup = bs(source, 'lxml')
        data = soup.select_one('#jobPostingSchema').text #json like string containing all info
        item = json.loads(data)

        result = {
        'Title' : item['title'],
         'Company' : item['hiringOrganization']['name'],
         'Url' : link,
         'Summary' :bs(item['description'],'lxml').text
    }

        results.append(result)
        time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary']) 
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )