我正在尝试使用scrapy和python抓取我的第一个网站(https://news.ycombinator.com/jobs)。我需要提取的信息如下: -招聘公司的名称 -公司所在地 -广告的位置
html页面中的这些字段没有单独的标记。并且文本没有特定的模式。 例如 ZeroCater(YC W11)将在SF招聘一名首席工程师:必须喜欢食物
仅Regex不足以提取此信息。有什么有效而简单的解决方案可以解决这个问题吗?
我尝试了python regex。我还研究了有关NLP和使用nltk的文本分类。但是nltk会增加代码的复杂性,并且很耗时。
答案 0 :(得分:2)
在这种情况下,我将尝试找到任何模式来帮助我提取这些数据,
例如,我可以看到这些词很常见"is hiring|is looking for|is looking to hire|hiring"
,公司名称排在首位,位置也排在in
之后:
这只是一个小试验,您可以扩展它以获取所需的东西
import re
text = """ZeroCater (YC W11) Is Hiring a Principal Engineer in SF: Must Love Food (zerocater.com)
OneSignal Is Hiring Full Stack Engineers in San Mateo (onesignal.com)
Faire (YC W17) Is Looking to Hire Business Operations Leads (greenhouse.io)
InsideSherpa (YC W19) Is Hiring Software Engineers in Sydney (workable.com)
Jerry (YC S17) Is Hiring Senior Software Dev, Data Engineer (Toronto/Remote) (getjerry.com)
Iris Automation Is Hiring an Account Executive for B2B Flying Vehicle Software (irisonboard.com)"""
data = text.lower().splitlines()
for i, line in enumerate(data):
# getting company name
data[i] = re.split(r'is hiring|is looking for|is looking to hire|hiring', line)
# job title and location if present
data[i][1] = re.split(r' in ', data[i][1])
print('company --- Job Title --- Location')
for c in data:
print(f'{c[0]} --- {c[1][0]} --- {c[1][1] if len(c[1])>1 else ""}')
输出:
company --- Job Title --- Location
zerocater (yc w11) --- a principal engineer --- sf: must love food (zerocater.com)
onesignal --- full stack engineers --- san mateo (onesignal.com)
faire (yc w17) --- business operations leads (greenhouse.io) ---
insidesherpa (yc w19) --- software engineers --- sydney (workable.com)
jerry (yc s17) --- senior software dev, data engineer (toronto/remote) (getjerry.com) ---
iris automation --- an account executive for b2b flying vehicle software (irisonboard.com) ---
确保此代码需要大量修改才能获得可靠的结果,但至少是一个开始