html代码看起来像这样:
<img alt="Papa's Cupcakeria To Go!" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-old-hires="" class="a-dynamic-image a-stretch-vertical" id="landingImage" data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L.png":[512,512],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SX425_.png":[425,425],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SX466_.png":[466,466],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SY450_.png":[450,450],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SY355_.png":[355,355]}" style="max-width:512px;max-height:512px;">
我想获得“ https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L.png”,现在我正在使用
extract_item(hxs.xpath("//img[@id='landingImage']/@data-a-dynamic-image"))
,我得到的是该标签内的所有内容。 如何仅获取第一个网址?
答案 0 :(得分:0)
如果您只想要第一个网址:
full_content = extract_item(hxs.xpath("//img[@id='landingImage']/@data-a-dynamic-image"))
list_contents = full_content.split(";")
first_image = list_contents[1].replace(""","")
print first_image
此外,您可以参考this来使用正则表达式提取URL。