尝试使用Python解析网站:我想保存到csv并在多个页面上运行

时间:2017-01-23 19:24:51

标签: python web-scraping beautifulsoup

我是Python和webscraping的新手。我试图自己做,但我卡住了。

我想通过网络搜索efinancinalcareers.com获取工作机会。我编写代码来获取html的元素,我可以在控制台上打印它们,但我需要帮助将数据保存到csv并在所有结果页面上运行脚本。这是代码:

import requests
from bs4 import BeautifulSoup
import csv
import datetime
print datetime.datetime.now()
url = "http://www.efinancialcareers.com/search?page=1&sortBy=POSTED_DESC&searchMode=DEFAULT_SEARCH&jobSearchId=RUJFMEZDNjA2RTJEREJEMDcyMzlBQ0YyMEFDQjc1MjUuMTQ4NTE5MDY3NTI0Ni4tMTQ1Mjc4ODU3NQ%3D%3D&updateEmitter=SORT_BY&filterGroupForm.includeRefreshed=true&filterGroupForm.datePosted=OTHER"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')

f = open ('EFINCAR.txt', 'w')
f.write('Job name;')
f.write('Salary;')
f.write('Location;')
f.write('Position;')
f.write('Company')
f.write('Date')
f.write('\n')


# Job name
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for details in container.find_all('li',{'class':'jobPreview well'}):
        for h3 in details.find_all('h3'):
            job=h3.find('a')
        print(job.text)

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

# Position
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            position=details.find('li',{'class':'position'})
            print(position.text)

# Company
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            company=details.find('li',{'class':'company'})
            print(company.text)

# Date
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            datetext=details.find('li',{'class':'updated'})
            print(datetext.text)

#       Attributes assignment section

#       Job Name
job_name = job.get_text()
f.write(job_name.encode('utf-8'))
f.write(';')

#       Salary

salary_name = salary.get_text()
f.write(salary_name.encode('utf-8'))
f.write(';')

#       location
location_name = location.get_text()
location_name = location_name.strip()
f.write(location_name.encode('utf-8'))
f.write(';')

#       position
position_name = position.get_text()
position_name = position_name.strip()
f.write(position_name.encode('utf-8'))
f.write(';')

#       company
company_name = company.get_text()
company_name = company_name.strip()
f.write(company_name.encode('utf-8'))
f.write(';')

#       Datetext
datetext_name = datetext.get_text()
datetext_name = datetext_name.strip()
f.write(datetext_name.encode('utf-8'))
f.write(';')
f.write('\n')

f.close()
**strong text**
print('Finished!')

1 个答案:

答案 0 :(得分:1)

欢迎使用StackOverflow!

让我们来看看你的代码。

三级嵌套 循环(循环总数为18)。如 你可以看到它们几乎相同并包含:

for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):

所以不要写相同代码六次 - 你只能写一次并在其中做所有事情。例如:

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

可以写成:

# Salary & Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            print(location.text)

编写DRY(不要重复自己)代码被认为是一种好习惯。

您在控制台中看到已解析的html数据的原因是您在 for 循环中有print(XXXXX)次呼叫。解析元素时,它将在控制台中打印。

在文本文件(EFINCAR.txt)中查看数据,因为您的f.write(xxxx)来电是 OUTSIDE 您的for循环。您应该将它们移到print(xxxx)来电旁边。

例如:

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            salary_name = salary.get_text()
            f.write(salary_name.encode('utf-8'))
            f.write(';')

当你这样做时,你会发现解析html有问题。

  

提示:注意标签,新行和空格。

为了将数据保存到 csv 并正确执行,您应该在解析时删除它们。当然你可以跳过这个但结果可能看起来很难看。

最后,如果您要为几页或所有页面运行脚本,您应该检查页面数量如何反映您的请求URL。 例如,在第1页的情况下,您有:

http://www.efinancialcareers.com/search?page=1XXXXXXXXXXXXXXX
第2页的

http://www.efinancialcareers.com/search?page=2XXXXXXXXXXXXXXX

这意味着您应该使用网址= http://www.efinancialcareers.com/search?page={NUMBER_OF_PAGE}XXXXXXXXXXXXXXX

运行代码

NUMBER_OF_PAGE从1到LAST_PAGE。因此,如上所述,您可以简单地创建循环并生成URL,而不是硬编码URL。