我正在抓捕this网站。我有一个脚本,它会刮擦包含相关信息的句子。 现在我要做的是从被抓取的句子中提取以下信息。
没有所有三个必填字段的职位列表将被丢弃。
from bs4 import BeautifulSoup
import requests
# scrape the given website
url = "https://news.ycombinator.com/jobs"
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
table = content.find("table", attrs={"class": "itemlist"})
array = []
# now store the required data in an array
for elem in table.findAll('tr', attrs={'class': 'athing'}):
array.append({'id': elem_id,
'listing': elem.find('a',
attrs={'class': 'storylink'}).text})
答案 0 :(得分:1)
大多数工作似乎具有以下模式
ZeroCater (YC W11) Is Hiring a Principal Engineer in SF
^^^^^ --------- ^^^^^^ -- ^^
Company Position Location
您可以在正在招聘和在中拆分职位。
import requests
from bs4 import BeautifulSoup
import re
r=requests.get('https://news.ycombinator.com/jobs')
soup=BeautifulSoup(r.text,'html.parser')
job_titles=list()
for td in soup.findAll('td',{'class':'title'}):
job_titles.append(td.text)
split_regex=re.compile('\sis hiring\s|\sin\s', re.IGNORECASE)
job_titles_lists=[split_regex.split(title) for title in job_titles]
valid_jobs=[l for l in job_titles_lists if len(l) ==3]
#print the output
for l in valid_jobs:
for item,value in zip(['Company','Position','Location'],l):
print(item+':'+value)
print('\n')
输出
Company:Flexport
Position:software engineers
Location:Chicago and San Francisco (flexport.com)
Company:OneSignal
Position:a DevOps Engineer
Location:San Mateo (onesignal.com)
...
注意
答案 1 :(得分:0)
我会使用比Bitto回答更具体的内容,因为如果您只是寻找“正在招聘”的正则表达式,那么您会错过所有短语“正在寻找”或“正在寻找”的东西。模式是:[公司]是[位置]中的[动词] [位置]。基于此,如果您将句子拆分成一个列表,然后取“ is”之前,“ is”和“ in”之间以及“ in”之后的值,则只需查找“ is”和“ in”的索引。 '。像这样:
def split_str(sentence):
sentence = sentence.lower()
sentence = sentence.split(' ')
where_is = sentence.index('is')
where_in = sentence.index('in')
name_company = ' '.join(sentence[0:where_is])
position = ' '.join(sentence[where_is+2:where_in])
location = ' '.join(sentence[where_in+1:len(sentence)])
ans = (name_company, position, location)
test = [True if len(list(x)) !=0 else False for x in ans]
if False in test:
return ('None', 'None', 'None')
else:
return (name_company, position, location)
#not a valid input because it does not have a position
some_sentence1 = 'Streak CRM for Gmail (YC S11) Is Hiring in Vancouver'
#valid because it has company, position, location
some_sentence = 'Flexport is hiring software engineers in Chicago and San Francisco'
print(split_str(some_sentence))
print(split_str(some_sentence1))
我添加了一个检查器,该检查器将简单地确定是否缺少值,然后使用(“无”,“无”,“无”)使整个对象无效或返回所有值。
输出:
('flexport', 'software engineers', 'chicago and san francisco')
('None', 'None', 'None')
只是一个想法,这也不是完美的,因为“ [公司]希望在[位置]雇用[职位]”会给你回来(公司,“在[位置]雇用”位置)...您可以通过检查NLTK模块并使用它来过滤名词和其他名词来清理它。