如何使用beautifulsoup从职位描述中提取特定文本?

时间:2019-06-24 13:18:22

标签: python beautifulsoup

我的目标是仅从linkedin的每个职位描述中提取文本。 到目前为止,代码说明还很模糊,并且还从html代码中提取了垃圾文本。一种解决方案是尝试使用xpath或css选择器,但是由于我在这方面相对较新,所以找不到合适的代码。 如果有帮助,我将发布xpath和CSS选择器以解决文本。

Scrapy似乎无法在我的计算机上运行,​​因为根据我所读的内容,Scrapy v1.6在Windows上仍无法运行,这实在太可惜了。 如果我仍然找不到.find_all()或.select()的方法,我将不胜感激。 CSS选择器和xpath是:

css = '#job-details > span'
xpath = '//*[@id="job-details"]/span'


    import requests
    import numpy as np
    import pandas as pd
    from bs4 import BeautifulSoup
    import re

    url = 'https://www.linkedin.com/jobs/search/?f_E=2&keywords=the%20data%20science'

# Getting our raw html

    with requests.Session() as s:
        response = s.get(url)

    html = response.content
    print(html)

    soup = BeautifulSoup(html, 'html.parser')
    linkedin = soup.prettify()

# to post all the links contained in the html codes


    for link in soup.find_all('a'):
        print(link.get('href'))

# Here we filter all the links that are job postings

    links = soup.findAll('a', 
    href=re.compile('https://www.linkedin.com/jobs/'))

    links_text = []

    for a in links:
        with requests.Session() as s:
            response = s.get(a.get('href'))
            html_=response.content
            soup_= BeautifulSoup(html_,'html.parser')
            links_text.append(soup_.get_text())

# Getting the raw text dataframe with the job links.

    links_text = np.asarray(links_text)

0 个答案:

没有答案