区分空标签和缺失标签

时间:2018-06-29 13:29:16

标签: xpath scrapy

在抓取时,我需要检测何时缺少标签,以了解页面结构已更改。但是,无论标签丢失还是空,我都会得到None。我该如何实现?

这是一个最小的示例:

from scrapy.http.response.text import TextResponse

normal = '<html><div id="brand">a</div></html>'
empty = '<html><div id="brand"></div></html>'
absent = '<html></html>'

res_normal = TextResponse(url='', encoding='utf-8', body=normal)
res_empty = TextResponse(url='', encoding='utf-8', body=empty)
res_absent = TextResponse(url='', encoding='utf-8', body=absent)

brand_normal = res_normal.xpath('//div[@id="brand"]/text()').extract_first()
brand_empty = res_empty.xpath('//div[@id="brand"]/text()').extract_first()
brand_absent = res_absent.xpath('//div[@id="brand"]/text()').extract_first()

print(brand_normal, brand__empty, brand_absent)

当前输出:

a None None

所需的输出:

a '' None

2 个答案:

答案 0 :(得分:0)

查询div元素,然后查询其相对于上一个查询的text()内容,您可以编写逻辑以获取所需的内容。如果brand == None做某事,if(len(brand_txt) >=1)做其他事,等等。

from scrapy.http.response.text import TextResponse

normal = '<html><div id="brand">a</div></html>'

res_normal = TextResponse(url='', encoding='utf-8', body=normal)

brand = res.xpath('//div[@id="brand"]')
brand_txt = brand.xpath('.').extract()
if(len(brand_txt) >=1):
    print('div contains text')

答案 1 :(得分:0)

根据路易斯·穆尼兹(LuisMuñoz)的回答,我做了这个有用的包装,可以返回期望的值。

def text(node, is_attribute=False):
    val = ''
    if node.get():
        if is_attribute:
            parsed_val = node.extract_first()
        else:
            parsed_val = node.xpath('./text()').extract_first()            # parsed_val is None if a node is empty but present, that's what we want to avoid
        if parsed_val:
            val = parsed_val
    else:
        val = None
    return val