Question

我想从以下页面抓取职位描述：https://www.aha.io/company/careers/current-openings/customer_success_specialist_project_management_us

除了按钮之外，我想使用"container py2 content job"类获取div内的所有文本和HTML。它位于<a>类的"btn btn-large btn-secondary"标记中。

我有两个我认为应该起作用的xpath选择器，但没有。第一个不排除按钮，第二个不包含所有其他要保留的HTML。

response.xpath('//div[@class ="container py2 content job"] 
[not(parent::a/@class="btn btn-large btn-secondary")]').extract()

response.xpath('//div[@class ="container py2 content 
job"]/descendant::text()[not(parent::a/@class="btn btn-large btn- 
secondary")]').extract()

在div内减去标记内的所有HTML都不会。我希望缺少一些简单的东西，但找不到文档中想要的东西。

Answer 1

job_html = response.css('div.content *').extract()
job_html = [x for x in job_html if "Apply now" not in x]
print(job_html)

如何使用Scrapy跳过子元素

1 个答案: