Question

我是scrapy和XPath的新手，但在Python中编程已有一段时间了。我想使用scrapy从页面email获取name of the person making the offer，phone和https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/号码。如您所见，电子邮件和电话在<p>标记内以文本形式提供，这使得难以提取。

我的想法是首先在Job Overview内部或至少所有文本中讨论相关工作，然后使用ReGex获取email，phone number和如果可能name of the person。

所以，我使用命令scrapy shell启动scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/并从那里获取response。

现在，我尝试从div job_description获取所有文本，而实际上我什么都没得到。我用了

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

返回[u'\t\t\t\n\t\t ']

如何从上述页面获取所有文字？显然，之后将完成任务以获取之前提到的属性，但首先要做的事情。

更新：此选择仅返回[] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

Answer 1

你很接近

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

除了你得到的内容之外，div - 标签实际上没有任何文字。

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

如您所见，response.xpath('//div[@class="job_description"]/text()').extract()所带来的文字是{strong>介于 div - 标签之间的文字，而不是div内的标签之间的文字。 1}} - 标签。为此你需要：

response.xpath('//div[@class="job_description"]//*/text()').extract()

这样做是从div[@class="job_description]中选择所有子节点并返回文本（请参阅here了解不同的xpath所做的事情）。

你会看到这会返回很多无用的文本，因为你仍然得到所有\n等等。为此，我建议您将xpath缩小到您想要的元素，而不是采用广泛的方法。

例如，整个职位描述将在

中

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()

如何使用scrapy获取工作描述？

1 个答案: