为什么Scrapy Selector无法解析所有标签?

时间:2016-02-06 23:20:02

标签: python parsing scrapy

你好,那就是看我的问题,

我遇到的问题是scrapy选择器似乎没有正确解析网站的标签。

    pp re.findall("meta.*",response.body)
['meta name="verify-v1" content="C4vnWz0WNdkra4aXTdZ9iegoTDhnazsNf0RVwXaT9eM=">\r',
 'meta http-equiv="content-type" content="text/html" charset="utf-8" />\r',
 'meta http-equiv="X-UA-Compatible" content="IE=edge" />\r',
 'meta name="WT.cg_n" content="Part Search" />\r',
 'meta name="WT.cg_s" content="Part Detail" />\r',
 'meta name="WT.ti" content="Part Detail" />\r',
 'meta name="WT.z_page_type" content="PS" />\r',
 'meta name="WT.z_page_sub_type" content="PD" />\r',
 'meta name="WT.z_page_id" content="PD" />\r',
 'meta name="WT.pn_sku" content=481-2"X36YD-ND />\r',
 'meta name="WT.z_part_id" content=1819153 />\r',
 'meta name="WT.tx_e" content="v" />\r',
 'meta name="WT.tx_u" content="1" />\r',
 'meta name="WT.z_supplier_id" content=19 />\r',
 'meta itemprop="productID" content="sku:481-2"X36YD-ND" />\r',
 'meta itemprop="name" content="481-2"X36YD" />\r']
ipdb> pp response.xpath("//meta")
[<Selector xpath='//meta' data=u'<meta name="verify-v1" content="C4vnWz0W'>,
 <Selector xpath='//meta' data=u'<meta http-equiv="content-type" content='>,
 <Selector xpath='//meta' data=u'<meta http-equiv="X-UA-Compatible" conte'>,
 <Selector xpath='//meta' data=u'<meta name="description" content=\'Find 3'>]
ipdb>

我无法弄清楚为什么会发生这种情况以及为什么其他标签即使在网站上存在也无法解析?

感谢。

1 个答案:

答案 0 :(得分:0)

我发现BeautifulSoup内置html.parser可以更好地处理此特定标记:

$ scrapy shell https://www.digikey.com/product-detail/en/481-2%22X36YD/481-2%22X36YD-ND/1819153
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(response.body, "html.parser")
>>>
>>> from pprint import pprint
>>> pprint([meta["content"] for meta in soup.find_all("meta")])
[u'C4vnWz0WNdkra4aXTdZ9iegoTDhnazsNf0RVwXaT9eM=',
 u'text/html',
 u'IE=edge',
 u'Find 3M 481-2"\x00X36YD (481-2"\x00X36YD-ND) at DigiKey.  Check stock and pricing, view product specifications, and order online.',
 u'481-2"\x00X36YD, 3M, Tape',
 u'Digi-Key Search Engine',
 u'Part Search',
 u'Part Detail',
 u'Part Detail',
 u'PS',
 u'PD',
 u'PD',
 u'481-2"X36YD-ND',
 u'1819153',
 u'v',
 u'1',
 u'19',
 u'sku:481-2"X36YD-ND',
 u'481-2"X36YD']

您在Scrapy项目中实际可以做的是将response.body通过BeautifulSoup HTML Parser传递到中间件中 - 基本上用BeautifulSoup“修复”损坏的HTML。这不需要对已经拥有的蜘蛛进行任何更改。以下是一个示例中间件实现: