Scrapy Link提取器不起作用

时间:2014-06-11 06:36:44

标签: scrapy

我正试图抓住这个页面http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results

我有这条规则:

rules = (
    Rule(
        SgmlLinkExtractor(allow=r'storeId='),
        callback="parse_item"
    ),
)

页面上有16个链接,但此规则只找到13.如果我在本地保存该页面然后尝试找到所有16个。

这让我发疯,这个网页出了什么问题?

1 个答案:

答案 0 :(得分:0)

您可以使用其他链接提取器,例如RegexLinkExtractor,而不是SgmlLinkExtractor

paul@machine:~$ scrapy shell "http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000 [default] INFO: Spider opened
2014-06-11 15:49:43+0000 [default] DEBUG: Crawled (200) <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x2961210>
[s]   item       {}
[s]   request    <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results>
[s]   response   <200 http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x2fab5d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor

In [2]: lx = RegexLinkExtractor(allow=r'storeId=')

In [3]: lx.extract_links(response)
Out[3]: 
[Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636', text=u'VIEW STORE DETAILS', fragment='', nofollow=False)]

In [4]: len(lx.extract_links(response))
Out[4]: 16

In [5]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [6]: lx = SgmlLinkExtractor(allow=r'storeId=')

In [7]: len(lx.extract_links(response))
Out[7]: 13

In [8]: