我是Python和webscraping的新手。我试图自己做,但我卡住了。
我想通过网络搜索efinancinalcareers.com获取工作机会。我编写代码来获取html的元素,我可以在控制台上打印它们,但我需要帮助将数据保存到csv并在所有结果页面上运行脚本。这是代码:
import requests
from bs4 import BeautifulSoup
import csv
import datetime
print datetime.datetime.now()
url = "http://www.efinancialcareers.com/search?page=1&sortBy=POSTED_DESC&searchMode=DEFAULT_SEARCH&jobSearchId=RUJFMEZDNjA2RTJEREJEMDcyMzlBQ0YyMEFDQjc1MjUuMTQ4NTE5MDY3NTI0Ni4tMTQ1Mjc4ODU3NQ%3D%3D&updateEmitter=SORT_BY&filterGroupForm.includeRefreshed=true&filterGroupForm.datePosted=OTHER"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
f = open ('EFINCAR.txt', 'w')
f.write('Job name;')
f.write('Salary;')
f.write('Location;')
f.write('Position;')
f.write('Company')
f.write('Date')
f.write('\n')
# Job name
for container in soup.find_all('div',{'class':'jobListContainer'}):
for details in container.find_all('li',{'class':'jobPreview well'}):
for h3 in details.find_all('h3'):
job=h3.find('a')
print(job.text)
# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
salary=details.find('li',{'class':'salary'})
print(salary.text)
# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
location=details.find('li',{'class':'location'})
print(location.text)
# Position
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
position=details.find('li',{'class':'position'})
print(position.text)
# Company
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
company=details.find('li',{'class':'company'})
print(company.text)
# Date
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
datetext=details.find('li',{'class':'updated'})
print(datetext.text)
# Attributes assignment section
# Job Name
job_name = job.get_text()
f.write(job_name.encode('utf-8'))
f.write(';')
# Salary
salary_name = salary.get_text()
f.write(salary_name.encode('utf-8'))
f.write(';')
# location
location_name = location.get_text()
location_name = location_name.strip()
f.write(location_name.encode('utf-8'))
f.write(';')
# position
position_name = position.get_text()
position_name = position_name.strip()
f.write(position_name.encode('utf-8'))
f.write(';')
# company
company_name = company.get_text()
company_name = company_name.strip()
f.write(company_name.encode('utf-8'))
f.write(';')
# Datetext
datetext_name = datetext.get_text()
datetext_name = datetext_name.strip()
f.write(datetext_name.encode('utf-8'))
f.write(';')
f.write('\n')
f.close()
**strong text**
print('Finished!')
答案 0 :(得分:1)
欢迎使用StackOverflow!
让我们来看看你的代码。
六三级嵌套 循环(循环总数为18)。如 你可以看到它们几乎相同并包含:
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
所以不要写相同代码六次 - 你只能写一次并在其中做所有事情。例如:
# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
salary=details.find('li',{'class':'salary'})
print(salary.text)
# Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
location=details.find('li',{'class':'location'})
print(location.text)
可以写成:
# Salary & Location
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
location=details.find('li',{'class':'location'})
salary=details.find('li',{'class':'salary'})
print(salary.text)
print(location.text)
编写DRY(不要重复自己)代码被认为是一种好习惯。
您在控制台中看到已解析的html数据的原因是您在 for 循环中有print(XXXXX)
次呼叫。解析元素时,它将在控制台中打印。
您不在文本文件(EFINCAR.txt)中查看数据,因为您的f.write(xxxx)
来电是 OUTSIDE 您的for循环。您应该将它们移到print(xxxx)
来电旁边。
例如:
# Salary
for container in soup.find_all('div',{'class':'jobListContainer'}):
for JobsPreview in container.find_all('li',{'class':'jobPreview well'}):
for details in JobsPreview.find_all('ul',{'class':'details'}):
salary=details.find('li',{'class':'salary'})
print(salary.text)
salary_name = salary.get_text()
f.write(salary_name.encode('utf-8'))
f.write(';')
当你这样做时,你会发现解析html有问题。
提示:注意标签,新行和空格。
为了将数据保存到 csv 并正确执行,您应该在解析时删除它们。当然你可以跳过这个但结果可能看起来很难看。
最后,如果您要为几页或所有页面运行脚本,您应该检查页面数量如何反映您的请求URL。 例如,在第1页的情况下,您有:
http://www.efinancialcareers.com/search?page=1XXXXXXXXXXXXXXX
第2页的:
http://www.efinancialcareers.com/search?page=2XXXXXXXXXXXXXXX
这意味着您应该使用网址= http://www.efinancialcareers.com/search?page={NUMBER_OF_PAGE}XXXXXXXXXXXXXXX
NUMBER_OF_PAGE从1到LAST_PAGE。因此,如上所述,您可以简单地创建循环并生成URL,而不是硬编码URL。