创建规则时遇到麻烦。
假设我的起始网址是http://www.example.com/search?q=news
当我在网络浏览器中打开这个网址时,我得到以下源代码:
<html><head>...</head><body>
<ul id="results-list">
<li class="result clearfix news">
<div class="summary">
<h3><a href="/sports/hockey/struggling-canucks-rely-on-schneider-to-snag-win-against-sens/article2243069/">Struggling Canucks rely on Schneider to snag win against Sens</a></h3>
<p class="summary">Nov 21, 2011– Eleventh place Canucks rely on goalie Cory Schneider to improve record to 10-9-1
</p>
<p class="meta"><a href="/sports/hockey/struggling-canucks-rely-on-schneider-to-snag-win-against-sens/article2243069/">http://www.example.com/sports/hockey/struggling-canucks-rely-on-schneider-to-snag-win-against-sens/article2243069/</a>
</p>
</div>
</li>
<li class="result clearfix news">
<div class="summary">
<h3><a href="/news/world/celebrities-set-to-testify-at-uk-media-ethics-inquiry/article2242840/">Celebrities set to testify at U.K. media ethics inquiry</a></h3>
<p class="summary">Nov 20, 2011– Hugh Grant and J.K. Rowling given opportunity to strike back against tabloids’ invasion of privacy
</p>
<p class="meta"><a href="/news/world/celebrities-set-to-testify-at-uk-media-ethics-inquiry/article2242840/">http://www.example.com/news/world/celebrities-set-to-testify-at-uk-media-ethics-inquiry/article2242840/</a>
</p>
</div>
</li>
...
</ul><!-- end of ul#results-list -->
<ul class="paginator">
<li class="selected"><a href="http://www.example.com/search/?q=news&start=0">1</a></li>
<li ><a href="http://www.example.com/search/?q=news&start=10">2</a></li>
<li ><a href="http://www.example.com/search/?q=news&start=20">3</a></li>
...
<li class="jump last"><a href="http://www.example.com/search/?q=news&start=90">Last</a></li>
</ul><!-- end of ul.paginator -->
</body></html>
现在我想从链接中提取数据(这个链接出现在ul#results-list中)http://www.example.com/sports/hockey/struggling-canucks-rely-on-schneider-to-snag-win-against-sens/article2243069/等等......
我为此创建了蜘蛛,如下所示:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from thirdapp.items import ThirdappItem
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/search?q=news',
'http://www.example.com/search?q=movies',
]
rules = (
Rule(SgmlLinkExtractor(allow('?q=news',), restrict_xpaths('ul[@class="paginator"]',)), callback='parse_item', allow=True),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s', response.url)
hxs = HtmlXPathSelector(response)
#item = ThirdappItem()
items = hxs.select('//h3')
scraped_items = []
for item in items:
scraped_item = ThirdappItem()
scraped_item["title"] = item.select('a/text()').extract()
scraped_items.append(scraped_item)
return items
spider = MySpider()
那么规则是什么,以便达到我期待的结果?
答案 0 :(得分:1)
首先,您期待的结果是什么? 第二,也许您应该解决规则中的链接,而不仅仅是包含列表项节点的ul-container,而不是所需的链接节点!?
答案 1 :(得分:0)
根据文档,SgmlLinkExtractor allow
参数 - 一个正则表达式(或正则表达式列表),(绝对)网址必须按顺序匹配被提取。所以allow
参数看起来像:
allow=('.*\?q=news.*',)
最有可能的是,规则的最后一个参数不是allow
,而是follow=True
。
最终规则(注意问号的转义字符):
Rule(SgmlLinkExtractor(allow=('.*\?q=news.*',), restrict_xpaths=('ul[@class="paginator"]',)), callback='parse_item', follow=True)