Question

我是Python和webscraping的新手。我试图自己做，但我卡住了。

我想通过网络搜索efinancinalcareers.com获取工作机会。我编写代码来获取html的元素，我可以在控制台上打印它们，但我需要帮助将数据保存到csv并在所有结果页面上运行脚本。这是代码：

import requests
from bs4 import BeautifulSoup
import csv
import datetime
print datetime.datetime.now()
url = "http://www.efinancialcareers.com/search?page=1&sortBy=POSTED_DESC&searchMode=DEFAULT_SEARCH&jobSearchId=RUJFMEZDNjA2RTJEREJEMDcyMzlBQ0YyMEFDQjc1MjUuMTQ4NTE5MDY3NTI0Ni4tMTQ1Mjc4ODU3NQ%3D%3D&updateEmitter=SORT_BY&filterGroupForm.includeRefreshed=true&filterGroupForm.datePosted=OTHER"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')

f = open ('EFINCAR.txt', 'w')
f.write('Job name;')
f.write('Salary;')
f.write('Location;')
f.write('Position;')
f.write('Company')
f.write('Date')
f.write('\n')


# Job name
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for details in container.find_all('li',{'class':'jobPreview well'}):
        for h3 in details.find_all('h3'):
            job=h3.find('a')
        print(job.text)

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

# Position
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            position=details.find('li',{'class':'position'})
            print(position.text)

# Company
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            company=details.find('li',{'class':'company'})
            print(company.text)

# Date
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            datetext=details.find('li',{'class':'updated'})
            print(datetext.text)

#       Attributes assignment section

#       Job Name
job_name = job.get_text()
f.write(job_name.encode('utf-8'))
f.write(';')

#       Salary

salary_name = salary.get_text()
f.write(salary_name.encode('utf-8'))
f.write(';')

#       location
location_name = location.get_text()
location_name = location_name.strip()
f.write(location_name.encode('utf-8'))
f.write(';')

#       position
position_name = position.get_text()
position_name = position_name.strip()
f.write(position_name.encode('utf-8'))
f.write(';')

#       company
company_name = company.get_text()
company_name = company_name.strip()
f.write(company_name.encode('utf-8'))
f.write(';')

#       Datetext
datetext_name = datetext.get_text()
datetext_name = datetext_name.strip()
f.write(datetext_name.encode('utf-8'))
f.write(';')
f.write('\n')

f.close()
**strong text**
print('Finished!')

Answer 1

欢迎使用StackOverflow！

让我们来看看你的代码。

六三级嵌套循环（循环总数为18）。如你可以看到它们几乎相同并包含：

for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):

所以不要写相同代码六次 - 你只能写一次并在其中做所有事情。例如：

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)

# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            print(location.text)

可以写成：

# Salary & Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            location=details.find('li',{'class':'location'})
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            print(location.text)

编写DRY（不要重复自己）代码被认为是一种好习惯。

您在控制台中看到已解析的html数据的原因是您在 for 循环中有print(XXXXX)次呼叫。解析元素时，它将在控制台中打印。

您不在文本文件（EFINCAR.txt）中查看数据，因为您的f.write(xxxx)来电是 OUTSIDE 您的for循环。您应该将它们移到print(xxxx)来电旁边。

例如：

# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
    for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
        for details in JobsPreview.find_all('ul',{'class':'details'}):
            salary=details.find('li',{'class':'salary'})
            print(salary.text)
            salary_name = salary.get_text()
            f.write(salary_name.encode('utf-8'))
            f.write(';')

当你这样做时，你会发现解析html有问题。

提示：注意标签，新行和空格。

为了将数据保存到 csv 并正确执行，您应该在解析时删除它们。当然你可以跳过这个但结果可能看起来很难看。

最后，如果您要为几页或所有页面运行脚本，您应该检查页面数量如何反映您的请求URL。例如，在第1页的情况下，您有：

http://www.efinancialcareers.com/search?page=1XXXXXXXXXXXXXXX

第2页的

：

http://www.efinancialcareers.com/search?page=2XXXXXXXXXXXXXXX

这意味着您应该使用网址= http://www.efinancialcareers.com/search?page={NUMBER_OF_PAGE}XXXXXXXXXXXXXXX

运行代码

NUMBER_OF_PAGE从1到LAST_PAGE。因此，如上所述，您可以简单地创建循环并生成URL，而不是硬编码URL。

尝试使用Python解析网站：我想保存到csv并在多个页面上运行

1 个答案: