我必须清除此数据
这是我要从link抓取的网站。我能够获取td数据,但我需要从特定的td标签开始(即从此tr标签开始)
<tr style="height:14px"></tr>
<tr class='athing' id='20463814'>
<td align="right" valign="top" class="title"><span class="rank"></span></td> <td></td><td class="title"><a href="https://mino-games.workable.com/j/69BCF95C8F" class="storylink" rel="nofollow">Mino Games (YC W11) Is Hiring Game Developers in Montreal</a><span class="sitebit comhead"> (<a href="from?site=workable.com"><span class="sitestr">workable.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
<span class="age"><a href="item?id=20463814">11 hours ago</a></span> </td></tr>
,然后继续使用其他标签,同时继续在单独的变量中获取公司名称,位置和位置的数据。我知道有很多要求,但我会很感激您能提供的任何帮助。
这是我尝试过的:
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
table_body = soup.find('tbody')
rows = soup.find('tr')
for row in rows:
cols = row.find_all('td')
cols = [x.text.strip() for x in cols]
print (cols)
答案 0 :(得分:0)
您想要的不是一个简单的问题,但是此脚本可以帮助您入门:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
t = title.get_text(strip=True)
company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
if company:
company = company[0].strip()
else:
company = '-'
position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
if position:
position = position[0].strip()
else:
position = '-'
location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
if location:
location = location[0].strip()
else:
location = '-'
rows.append([company, position, location])
print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
c, p, l = row
print('{: <50}{: <80}{: <20}'.format(c, p, l))
打印:
Company Position Location
Scale AI engineers to accelerate the development of AI -
Mino Games (YC W11) Game Developers Montreal
BuildZoom (YC W13) – Help us un-break construction -
Bitmovin (YC S15) a Video Solutions Architect/Software Engineer Brazil
Streak – CRM for Gmail (YC S11) Vancouver
ZeroCater (YC W11) a Director of Engineer SF
UpCodes (YC S17) engineers to automate compliance for architects -
Tech Nonprofit Upsolve (YC W19) a Software Engineer -
Gitlab (YC W15) an Engineering Manager, Ecosystem -
Saleswhale (YC S16) Our First U.S. Strategic Account Executive -
Jerry (YC S17) for a Director of Ops and Growth -
Sourceress (YC S17) Product and ML Engineers (Remote OK, No Prior ML OK) -
GiveCampus (YC S15) a Product Designer who cares about education -
Iris Automation an Account Executive for B2B Flying Vehicle Software -
LogDNA (YC W15) Software Engineers – DevOps Monitoring at Scale -
Flexport software engineers to work on our trucking apps Chicago
Mux an ML engineer to help train our machines to deliver better video -
The Muse (YC W12) a Product Director for Growth -
OneSignal an SRE to scale our bare-metal infrastructure -
Atomwise (YC W15) a Senior Systems/Cloud Engineer -
Demodesk (YC W19) Software Engineers Munich
Gusto for Android and iOS developers to build our native mobile app -
Fond (YC W12) an Engineering Manager Portland
ReadMe (YC W15) – Help us make APIs easy to use -
Keeper (YC W19) a lead engineer – help save gig workers money on taxes -
Asseta (YC S13) a technical lead -
Tesorio (YC S15) Engineering Managers, Senior Engineers -
Standard Cognition (YC S17) – Work on vision systems Rust
Curebase (YC S18) first sales hire – distributed clinical research -
Mashgin (YC W15) a Fullstack SWE Interested Computer Vision/AI
答案 1 :(得分:0)
这是一个基本的抓取工具,可将标题分为公司和职位。
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'})
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def extract_jobs(soup: BeautifulSoup) -> list:
titles = soup.select('.storylink')
hiring_re = re.compile('\s+(is)?\s+(hiring|seeking|looking)\s+(for)?', flags=re.IGNORECASE)
jobs = []
for el in titles:
title = el.text.strip()
m = hiring_re.search(title)
if not m:
continue
company = title[:m.start()].strip()
offer = title[m.end():].strip().title()
jobs.append({
'company': company,
'wants': offer,
})
return jobs
url = 'https://news.ycombinator.com/jobs'
soup = make_soup(url)
jobs = extract_jobs(soup)
pprint(jobs)
输出:
{'company': 'Mino Games (YC W11)', 'wants': 'Game Developers In Montreal'},
{'company': 'BuildZoom (YC W13)', 'wants': '– Help Us Un-Break Construction'},
{'company': 'Streak – CRM for Gmail (YC S11)', 'wants': 'In Vancouver'},
{'company': 'ZeroCater (YC W11)', 'wants': 'A Director Of Engineer In Sf'},
{'company': 'UpCodes (YC S17)', 'wants': 'Engineers To Automate Compliance For Architects'},
{'company': 'Tech Nonprofit Upsolve (YC W19)', 'wants': 'A Software Engineer'},
...