Question

我正在努力抓取以获取班加罗尔所有工作清单的信息。

URL：https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0

我感兴趣的父div的Xpath：

// div [包含（@class，“ jobsearch-SerpJobCard”）]

我要提取这样的公司名称：

<span class="company">
        <a>
              Micro Focus
        </a>
</span>

和类似的东西：

<div>
    <span class="company">
        SSG <b>Software</b> Systems</span>

    </div>

我正在使用通用的Xpath表达式来抓取两种标题。我遇到第二种类型的麻烦，因为它包含多个转义字符，例如\ n，它们反映在我的结果中，而在剥离结果中则为空字符串。

用于提取标题的Xpath：

// div [包含（@class， “ jobsearch-SerpJobCard”）] // span [@ class =“ company”] / text（）

结果：

['\ n'，'\ n'，'\ n'，'\ n客户   Analytics人力资本”，“ \ n优势技术”，“ \ n”，   '\ n SQUARE'，'\ n DART'，'\ n posmab technologies'，   '\ n'，'\ n五角科技'，'\ n'，'\ n
  MobileComm，Inc。”，“ \ n IGLOBAL IMPACT ITES PVT.LTD。”，“ \ n
  '，'\ n']

我该怎么做才能摆脱掉那些多余的'\ n'字符？

Answer 1

您可以使用normalize-space XPath函数来实现此目的。

>>> fetch('https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0')
2018-12-15 09:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0> (referer: None)
>>> response.xpath('//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]').xpath('normalize-space()').getall()
['Amazon.com', 'Sabre', 'Altisource Labs', 'CGI', 'Allscripts Solutions', 'Shilpin Consulting', 'Access6 technology', 'CGI Group, Inc.', 'Misys Software Solutions India', 'Siemens AG']

Scrapy获取跨越多行并在嵌套元素内的文本

1 个答案: