所以我想要做的是从链接中的特定标签中获取文本,我想要做的只是在文本包含某些单词时才返回html:例如:text包含" chemical&# 34;然后返回该链接,如果没有通过
这是我的代码:
import requests
from bs4 import BeautifulSoup
import webbrowser
jobsearch = input("What type of job?: ")
location = input("What is your location: ")
url = ("https://ca.indeed.com/jobs?q=" + jobsearch + "&l=" + location)
base_url = 'https://ca.indeed.com/'
r = requests.get(url)
rcontent = r.content
prettify = BeautifulSoup(rcontent, "html.parser")
all_job_url = []
def get_all_joblinks():
for tag in prettify.find_all('a', {'data-tn-element':"jobTitle"}):
link = tag['href']
all_job_url.append(link)
def filter_links():
for eachurl in all_job_url:
rurl = requests.get(base_url + eachurl)
content = rurl.content
soup = BeautifulSoup(content, "html.parser")
summary = soup.find('td', {'class':'snip'}).get_text()
print(summary)
def search_job():
while True:
if prettify.select('div.no_results'):
print("no job matches found")
break
else:
# opens the web page of job search if entries are found
website = webbrowser.open_new(url);
break
get_all_joblinks()
filter_links()
答案 0 :(得分:0)
您似乎从get_all_joblinks
函数中的单个really.ca页面获取了所有链接。以下是如何检查典型链接是否在其body
元素的文本中的某处提及“化学”。
>>> import requests
>>> import bs4
>>> page = requests.get('https://jobs.sanofi.us/job/-/-/507/4895612?utm_source=indeed.com&utm_campaign=sanofi%20sem%20campaign&utm_medium=job_aggregator&utm_content=paid_search&ss=paid').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> body = soup.find('body').text
>>> chemical_present = body.lower().find('chemical')>-1
>>> chemical_present
True
我希望这就是你要找的东西。
编辑,以回应评论。
>>> import webbrowser
>>> job_type = 'engineer'
>>> location = 'Toronto'
>>> url = "https://ca.indeed.com/jobs?q=" + job_type + "&l=" + location
>>> base_url = '%s://%s' % parse.urlparse(url)[0:2]
>>> page = requests.get(url).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.find_all('a', {'data-tn-element':"jobTitle"}):
... job_page = requests.get(base_url+link['href']).content
... job_soup = bs4.BeautifulSoup(job_page, 'lxml')
... body = job_soup.find('body').text
... if body.lower().find('chemical')>-1:
... webbrowser.open(base_url+link['href'])