在抓取时,我需要检测何时缺少标签,以了解页面结构已更改。但是,无论标签丢失还是空,我都会得到None
。我该如何实现?
这是一个最小的示例:
from scrapy.http.response.text import TextResponse
normal = '<html><div id="brand">a</div></html>'
empty = '<html><div id="brand"></div></html>'
absent = '<html></html>'
res_normal = TextResponse(url='', encoding='utf-8', body=normal)
res_empty = TextResponse(url='', encoding='utf-8', body=empty)
res_absent = TextResponse(url='', encoding='utf-8', body=absent)
brand_normal = res_normal.xpath('//div[@id="brand"]/text()').extract_first()
brand_empty = res_empty.xpath('//div[@id="brand"]/text()').extract_first()
brand_absent = res_absent.xpath('//div[@id="brand"]/text()').extract_first()
print(brand_normal, brand__empty, brand_absent)
当前输出:
a None None
所需的输出:
a '' None
答案 0 :(得分:0)
查询div
元素,然后查询其相对于上一个查询的text()内容,您可以编写逻辑以获取所需的内容。如果brand == None
做某事,if(len(brand_txt) >=1)
做其他事,等等。
from scrapy.http.response.text import TextResponse
normal = '<html><div id="brand">a</div></html>'
res_normal = TextResponse(url='', encoding='utf-8', body=normal)
brand = res.xpath('//div[@id="brand"]')
brand_txt = brand.xpath('.').extract()
if(len(brand_txt) >=1):
print('div contains text')
答案 1 :(得分:0)
根据路易斯·穆尼兹(LuisMuñoz)的回答,我做了这个有用的包装,可以返回期望的值。
def text(node, is_attribute=False):
val = ''
if node.get():
if is_attribute:
parsed_val = node.extract_first()
else:
parsed_val = node.xpath('./text()').extract_first() # parsed_val is None if a node is empty but present, that's what we want to avoid
if parsed_val:
val = parsed_val
else:
val = None
return val