Question

我必须清除此数据

正在招聘的公司名称
公司所在地
广告所针对的位置

这是我要从link抓取的网站。我能够获取td数据，但我需要从特定的td标签开始（即从此tr标签开始）

<tr style="height:14px"></tr>
        <tr class='athing' id='20463814'>
  <td align="right" valign="top" class="title"><span class="rank"></span></td>      <td></td><td class="title"><a href="https://mino-games.workable.com/j/69BCF95C8F" class="storylink" rel="nofollow">Mino Games (YC W11) Is Hiring Game Developers in Montreal</a><span class="sitebit comhead"> (<a href="from?site=workable.com"><span class="sitestr">workable.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
    <span class="age"><a href="item?id=20463814">11 hours ago</a></span>      </td></tr>

，然后继续使用其他标签，同时继续在单独的变量中获取公司名称，位置和位置的数据。我知道有很多要求，但我会很感激您能提供的任何帮助。

这是我尝试过的：

import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/jobs'

plain_html_text = requests.get(url);

soup = BeautifulSoup(plain_html_text.text, "html.parser")

table_body = soup.find('tbody')
rows = soup.find('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [x.text.strip() for x in cols]
    print (cols)

Answer 1

您想要的不是一个简单的问题，但是此脚本可以帮助您入门：

import re
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/jobs'

plain_html_text = requests.get(url);

soup = BeautifulSoup(plain_html_text.text, "html.parser")

rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
    t = title.get_text(strip=True)

    company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
    if company:
        company = company[0].strip()
    else:
        company = '-'

    position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
    if position:
        position = position[0].strip()
    else:
        position = '-'

    location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
    if location:
        location = location[0].strip()
    else:
        location = '-'

    rows.append([company, position, location])

print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
    c, p, l = row
    print('{: <50}{: <80}{: <20}'.format(c, p, l))

打印：

                     Company                                                          Position                                          Location      
Scale AI                                          engineers to accelerate the development of AI                                   -                   
Mino Games (YC W11)                               Game Developers                                                                 Montreal            
BuildZoom (YC W13)                                – Help us un-break construction                                                 -                   
Bitmovin (YC S15)                                 a Video Solutions Architect/Software Engineer                                   Brazil              
Streak – CRM for Gmail (YC S11)                                                                                                   Vancouver           
ZeroCater (YC W11)                                a Director of Engineer                                                          SF                  
UpCodes (YC S17)                                  engineers to automate compliance for architects                                 -                   
Tech Nonprofit Upsolve (YC W19)                   a Software Engineer                                                             -                   
Gitlab (YC W15)                                   an Engineering Manager, Ecosystem                                               -                   
Saleswhale (YC S16)                               Our First U.S. Strategic Account Executive                                      -                   
Jerry (YC S17)                                    for a Director of Ops and Growth                                                -                   
Sourceress (YC S17)                               Product and ML Engineers (Remote OK, No Prior ML OK)                            -                   
GiveCampus (YC S15)                               a Product Designer who cares about education                                    -                   
Iris Automation                                   an Account Executive for B2B Flying Vehicle Software                            -                   
LogDNA (YC W15)                                   Software Engineers – DevOps Monitoring at Scale                                 -                   
Flexport                                          software engineers to work on our trucking apps                                 Chicago             
Mux                                               an ML engineer to help train our machines to deliver better video               -                   
The Muse (YC W12)                                 a Product Director for Growth                                                   -                   
OneSignal                                         an SRE to scale our bare-metal infrastructure                                   -                   
Atomwise (YC W15)                                 a Senior Systems/Cloud Engineer                                                 -                   
Demodesk (YC W19)                                 Software Engineers                                                              Munich              
Gusto                                             for Android and iOS developers to build our native mobile app                   -                   
Fond (YC W12)                                     an Engineering Manager                                                          Portland            
ReadMe (YC W15)                                   – Help us make APIs easy to use                                                 -                   
Keeper (YC W19)                                   a lead engineer – help save gig workers money on taxes                          -                   
Asseta (YC S13)                                   a technical lead                                                                -                   
Tesorio (YC S15)                                  Engineering Managers, Senior Engineers                                          -                   
Standard Cognition (YC S17)                       – Work on vision systems                                                        Rust                
Curebase (YC S18)                                 first sales hire – distributed clinical research                                -                   
Mashgin (YC W15)                                  a Fullstack SWE Interested                                                      Computer Vision/AI

Answer 2

这是一个基本的抓取工具，可将标题分为公司和职位。

import requests
from bs4 import BeautifulSoup
import re

from pprint import pprint

def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url, headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'})
    res.raise_for_status()
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    return soup

def extract_jobs(soup: BeautifulSoup) -> list:
    titles = soup.select('.storylink')
    hiring_re = re.compile('\s+(is)?\s+(hiring|seeking|looking)\s+(for)?', flags=re.IGNORECASE)

    jobs = []
    for el in titles:
        title = el.text.strip()
        m = hiring_re.search(title)
        if not m:
            continue
        company = title[:m.start()].strip()
        offer = title[m.end():].strip().title()
        jobs.append({
            'company': company,
            'wants': offer,
        })
    return jobs


url = 'https://news.ycombinator.com/jobs'
soup = make_soup(url)
jobs = extract_jobs(soup)
pprint(jobs)

输出：

 {'company': 'Mino Games (YC W11)', 'wants': 'Game Developers In Montreal'},
 {'company': 'BuildZoom (YC W13)', 'wants': '– Help Us Un-Break Construction'},
 {'company': 'Streak – CRM for Gmail (YC S11)', 'wants': 'In Vancouver'},
 {'company': 'ZeroCater (YC W11)', 'wants': 'A Director Of Engineer In Sf'},
 {'company': 'UpCodes (YC S17)', 'wants': 'Engineers To Automate Compliance For Architects'},
 {'company': 'Tech Nonprofit Upsolve (YC W19)', 'wants': 'A Software Engineer'},
...

从td标签抓取特定数据

2 个答案: