将数据抓取输出到数据帧

时间:2020-08-31 13:39:28

标签: python pandas dataframe web-scraping

您好,到目前为止,我已经从JobListing网站上搜刮了此信息。一切似乎都运行良好,但是我正努力将这些信息放入包含标头和所有内容的数据帧中。任何帮助表示赞赏。 我的完整代码是:

import requests
from bs4 import BeautifulSoup
import pandas as pd 

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(id='ResultsContainer')

python_jobs = results.find_all('h2',string=lambda text: 'test' in text.lower())
for p_job in python_jobs:
    link = p_job.find('a')['href']
    print(p_job.text.strip())
    print(f"Apply Here: {link}")

job_elems = results.find_all('section', class_= 'card-content')

for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem):
        continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    print()

不确定如何处理。

2 个答案:

答案 0 :(得分:0)

您可以将工作详细信息(即职务,公司和位置)保存在字典中,然后对字典进行数据框化。

import requests
from bs4 import BeautifulSoup
import pandas as pd 

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(id='ResultsContainer')

python_jobs = results.find_all('h2',string=lambda text: 'test' in text.lower())
for p_job in python_jobs:
    link = p_job.find('a')['href']
    print(p_job.text.strip())
    print(f"Apply Here: {link}")

job_elems = results.find_all('section', class_= 'card-content')
i = 1
my_job_list = {}
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem):
        continue
    op = f'opening {i}'
    my_job_list[op] = {'position':title_elem.text.strip(), 'company': 
company_elem.text.strip(), 'location': location_elem.text.strip()}
    i= i+1
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())

df = pd.DataFrame(my_job_list)

print(df)

答案 1 :(得分:0)

对所有列使用concat(),然后将append()循环到一个数据帧

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(id='ResultsContainer')

python_jobs = results.find_all('h2',string=lambda text: 'test' in text.lower())
for p_job in python_jobs:
    link = p_job.find('a')['href']
    print(p_job.text.strip())
    print(f"Apply Here: {link}")

job_elems = results.find_all('section', class_= 'card-content')

df= pd.DataFrame()

for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem):
        continue
    df1=pd.concat([pd.Series(title_elem.text.strip()),
                  pd.Series(company_elem.text.strip()),
                  pd.Series(location_elem.text.strip())],axis=1)
    df=df.append(df1)
print(df)
相关问题