Question

我需要xpath以下HTML代码

  <div itemtype="http://schema.org/PostalAddress" itemscope="" itemprop="jobLocation">
  <div class="aiDetailJobInfoLabel aiDetailJobInfoLocation">Location: </div>
  <div class="aiDetailJobInfo aiDetailJobInfoLocation">
     <span itemprop="addressLocality">Topeka</span>
      , KS
      <span itemprop="postalCode">66607</span>
  </div>
</div>

在这个HTML代码中，我需要输出为 Topeka，KS

不应包括66607

我尝试使用此代码，但它给出了空

 >>> response.xpath('//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"][not(child::span[@itemprop="postalCode"])]//text()').extract()

如果我写下面的代码，它给出了

response.xpath（＆＃39; // div [@itemprop =＆＃34; jobLocation＆＃34;] / div [@class =＆＃34; aiDetailJobInfo aiDetailJobInfoLocation＆＃34;] // text（）＆＃ 39。）提取物（）

output: Topeka, KS, 66607

请帮帮我。

供参考： xpath将使用div text（）来排除邮政编码，以便返回剩余的div和span文本。有时postalCode不存在于此div标签中。因此，如果它存在，跳过它，如果不返回整个div标签文本。

Answer 1

这里我分享了2段代码。你可以随心所欲。

试试这个：

response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[ .a-zA-Z]\w+')



response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[a-zA-Z]+')


response.xpath('//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').extract()[1:3]

Answer 2

看起来您基本上希望连接目标div 的所有文本节点后代，除了 postalCode属性下的那些。相关的文本节点集将由类似

的XPath找到

//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]
   //text()[not(parent::span[@itemProp="postalCode"])]

如果您.extract这个XPath，您将获得一个字符串列表（每个文本节点一个），您可以在Python级别连接在一起。

div标签的Xpath不包括span标签和返回文本

2 个答案: