我是Python和Web爬虫的新手。我写了下面两行,从网站中提取标题和价格。但是,它提供带有html标记和'\ n'字符的输出。 如何删除它们并仅获取文本输出?
product_name = response.css('#productTitle::text')[0].extract().strip('\n')
product_price = response.css('#priceblock_ourprice')[0].extract().strip()
输出
[
" \n \n \n \n\n \n \n \n Stainless Steel Food Grinder Attachment fit KitchenAid Stand Mixers Including Sausage Stuffer, Dishwasher Safe,Durable Mixer Accessories as Meat Processor\n \n \n\n \n \n \n \n ",
"<span id=\"priceblock_ourprice\" class=\"a-size-medium a-color-price priceBlockBuyingPriceString\">$87.99</span>"
]
答案 0 :(得分:1)
删除\n
多余的空格:
for text in str_list:
text = text.replace("\n","") #remove all '\n' from text
while " " in text : # if 2 space symbols in sting
r_str = text .replace(" ", " ") # replace 2 spaces with 1 space and repeat until no more 2 consecutive spaces in text
第二个选择器也应在选择器中包含::text
:
product_price = response.css('#priceblock_ourprice::text').extract_first()