我的目标是仅从linkedin的每个职位描述中提取文本。 到目前为止,代码说明还很模糊,并且还从html代码中提取了垃圾文本。一种解决方案是尝试使用xpath或css选择器,但是由于我在这方面相对较新,所以找不到合适的代码。 如果有帮助,我将发布xpath和CSS选择器以解决文本。
Scrapy似乎无法在我的计算机上运行,因为根据我所读的内容,Scrapy v1.6在Windows上仍无法运行,这实在太可惜了。 如果我仍然找不到.find_all()或.select()的方法,我将不胜感激。 CSS选择器和xpath是:
css = '#job-details > span'
xpath = '//*[@id="job-details"]/span'
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
url = 'https://www.linkedin.com/jobs/search/?f_E=2&keywords=the%20data%20science'
# Getting our raw html
with requests.Session() as s:
response = s.get(url)
html = response.content
print(html)
soup = BeautifulSoup(html, 'html.parser')
linkedin = soup.prettify()
# to post all the links contained in the html codes
for link in soup.find_all('a'):
print(link.get('href'))
# Here we filter all the links that are job postings
links = soup.findAll('a',
href=re.compile('https://www.linkedin.com/jobs/'))
links_text = []
for a in links:
with requests.Session() as s:
response = s.get(a.get('href'))
html_=response.content
soup_= BeautifulSoup(html_,'html.parser')
links_text.append(soup_.get_text())
# Getting the raw text dataframe with the job links.
links_text = np.asarray(links_text)