Question

我正在抓捕this网站。我有一个脚本，它会刮擦包含相关信息的句子。现在我要做的是从被抓取的句子中提取以下信息。

正在招聘的公司名称
公司所在地
广告所针对的位置

没有所有三个必填字段的职位列表将被丢弃。

这是我的脚本

from bs4 import BeautifulSoup
import requests

# scrape the given website
url = "https://news.ycombinator.com/jobs"
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")

table = content.find("table", attrs={"class": "itemlist"})

array = []
# now store the required data in an array
for elem in table.findAll('tr', attrs={'class': 'athing'}):
    array.append({'id': elem_id,
                  'listing': elem.find('a',                                
                                       attrs={'class': 'storylink'}).text})

Answer 1

大多数工作似乎具有以下模式

ZeroCater (YC W11) Is Hiring a Principal Engineer in SF
 ^^^^^             ---------        ^^^^^^        -- ^^
Company                             Position         Location

您可以在正在招聘和在中拆分职位。

import requests
from bs4 import BeautifulSoup
import re
r=requests.get('https://news.ycombinator.com/jobs')
soup=BeautifulSoup(r.text,'html.parser')
job_titles=list()
for td in soup.findAll('td',{'class':'title'}):
    job_titles.append(td.text)

split_regex=re.compile('\sis hiring\s|\sin\s', re.IGNORECASE)
job_titles_lists=[split_regex.split(title) for title in job_titles]
valid_jobs=[l for l in job_titles_lists if len(l) ==3]

#print the output
for l in valid_jobs:
    for item,value in zip(['Company','Position','Location'],l):
        print(item+':'+value)
    print('\n')

输出

Company:Flexport
Position:software engineers
Location:Chicago and San Francisco (flexport.com)


Company:OneSignal
Position:a DevOps Engineer
Location:San Mateo (onesignal.com)

...

注意

不是完美的解决方案。
获得网站所有者的许可。

Answer 2

我会使用比Bitto回答更具体的内容，因为如果您只是寻找“正在招聘”的正则表达式，那么您会错过所有短语“正在寻找”或“正在寻找”的东西。模式是：[公司]是[位置]中的[动词] [位置]。基于此，如果您将句子拆分成一个列表，然后取“ is”之前，“ is”和“ in”之间以及“ in”之后的值，则只需查找“ is”和“ in”的索引。 '。像这样：

def split_str(sentence):
    sentence = sentence.lower()
    sentence = sentence.split(' ')
    where_is = sentence.index('is')
    where_in = sentence.index('in')
    name_company = ' '.join(sentence[0:where_is])
    position = ' '.join(sentence[where_is+2:where_in])
    location = ' '.join(sentence[where_in+1:len(sentence)])
    ans = (name_company, position, location)
    test = [True if len(list(x)) !=0 else False for x in ans]
    if False in test:
        return ('None', 'None', 'None')
    else:
        return (name_company, position, location)

#not a valid input because it does not have a position      
some_sentence1 = 'Streak CRM for Gmail (YC S11) Is Hiring in Vancouver'

#valid because it has company, position, location
some_sentence = 'Flexport is hiring software engineers in Chicago and San Francisco'

print(split_str(some_sentence))
print(split_str(some_sentence1))

我添加了一个检查器，该检查器将简单地确定是否缺少值，然后使用（“无”，“无”，“无”）使整个对象无效或返回所有值。

输出：

('flexport', 'software engineers', 'chicago and san francisco')
('None', 'None', 'None')

只是一个想法，这也不是完美的，因为“ [公司]希望在[位置]雇用[职位]”会给你回来（公司，“在[位置]雇用”位置）...您可以通过检查NLTK模块并使用它来过滤名词和其他名词来清理它。

通过网络抓取器从句子中提取相关信息？

2 个答案: