我有这条规则:
rules = (
Rule(
SgmlLinkExtractor(allow=r'storeId='),
callback="parse_item"
),
)
页面上有16个链接,但此规则只找到13.如果我在本地保存该页面然后尝试找到所有16个。
这让我发疯,这个网页出了什么问题?
答案 0 :(得分:0)
您可以使用其他链接提取器,例如RegexLinkExtractor
,而不是SgmlLinkExtractor
paul@machine:~$ scrapy shell "http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000 [default] INFO: Spider opened
2014-06-11 15:49:43+0000 [default] DEBUG: Crawled (200) <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x2961210>
[s] item {}
[s] request <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results>
[s] response <200 http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x2fab5d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor
In [2]: lx = RegexLinkExtractor(allow=r'storeId=')
In [3]: lx.extract_links(response)
Out[3]:
[Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636', text=u'VIEW STORE DETAILS', fragment='', nofollow=False)]
In [4]: len(lx.extract_links(response))
Out[4]: 16
In [5]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [6]: lx = SgmlLinkExtractor(allow=r'storeId=')
In [7]: len(lx.extract_links(response))
Out[7]: 13
In [8]: