Question

我的目标是仅从linkedin的每个职位描述中提取文本。到目前为止，代码说明还很模糊，并且还从html代码中提取了垃圾文本。一种解决方案是尝试使用xpath或css选择器，但是由于我在这方面相对较新，所以找不到合适的代码。如果有帮助，我将发布xpath和CSS选择器以解决文本。

Scrapy似乎无法在我的计算机上运行，因为根据我所读的内容，Scrapy v1.6在Windows上仍无法运行，这实在太可惜了。如果我仍然找不到.find_all（）或.select（）的方法，我将不胜感激。 CSS选择器和xpath是：

css = '#job-details > span'
xpath = '//*[@id="job-details"]/span'


    import requests
    import numpy as np
    import pandas as pd
    from bs4 import BeautifulSoup
    import re

    url = 'https://www.linkedin.com/jobs/search/?f_E=2&keywords=the%20data%20science'

# Getting our raw html

    with requests.Session() as s:
        response = s.get(url)

    html = response.content
    print(html)

    soup = BeautifulSoup(html, 'html.parser')
    linkedin = soup.prettify()

# to post all the links contained in the html codes


    for link in soup.find_all('a'):
        print(link.get('href'))

# Here we filter all the links that are job postings

    links = soup.findAll('a', 
    href=re.compile('https://www.linkedin.com/jobs/'))

    links_text = []

    for a in links:
        with requests.Session() as s:
            response = s.get(a.get('href'))
            html_=response.content
            soup_= BeautifulSoup(html_,'html.parser')
            links_text.append(soup_.get_text())

# Getting the raw text dataframe with the job links.

    links_text = np.asarray(links_text)

如何使用beautifulsoup从职位描述中提取特定文本？

0 个答案: