我有一个种子网址(比如DOMAIN/manufacturers.php
)没有分页,看起来像这样:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="st-text">
<table cellspacing="6" width="600">
<tr>
<td>
<a href="manufacturer1-type-59.php"></a>
</td>
<td>
<a href="manufacturer1-type-59.php">Name 1</a>
</td>
<td>
<a href="manufacturer2-type-5.php"></a>
</td>
<td>
<a href="manufacturer2-type-5.php">Name 2</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer3-type-88.php"></a>
</td>
<td>
<a href="manufacturer3-type-88.php">Name 3</a>
</td>
<td>
<a href="manufacturer4-type-76.php"></a>
</td>
<td>
<a href="manufacturer4-type-76.php">Name 4</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer5-type-28.php"></a>
</td>
<td>
<a href="manufacturer5-type-28.php">Name 5</a>
</td>
<td>
<a href="manufacturer6-type-48.php"></a>
</td>
<td>
<a href="manufacturer6-type-48.php">Name 6</a>
</td>
</tr>
</table>
</div>
</body>
</html>
从那里我想得到所有a['href'] 's
,例如:manufacturer1-type-59.php
。请注意,这些链接已不包含DOMAIN
前缀,所以我的猜测是我必须以某种方式添加它,或者可能不是?
或者,我希望将这些链接保留在memory
(适用于下一阶段),并将其保存到disk
以供将来参考。
每个链接的内容(例如manufacturer1-type-59.php
)如下所示:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="makers">
<ul>
<li>
<a href="manufacturer1_model1_type1.php"></a>
</li>
<li>
<a href="manufacturer1_model1_type2.php"></a>
</li>
<li>
<a href="manufacturer1_model2_type3.php"></a>
</li>
</ul>
</div>
<div class="nav-band">
<div class="nav-items">
<div class="nav-pages">
<span>Pages:</span><strong>1</strong>
<a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
<a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
<a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
</div>
</div>
</div>
</body>
</html>
接下来,我想获取所有a['href'] 's
,例如manufacturer_model1_type1.php
。再次注意,这些链接不包含域前缀。这里的另一个困难是这些页面支持分页。所以,我也想进入所有这些页面。正如所料,manufacturer-type-59.php
重定向到manufacturer-type-STRING-59-INT-p2.php
。
或者,我还希望将这些链接保留在memory
(适用于下一阶段),并将其保存到disk
以供将来参考。
第三步也是最后一步应该是检索manufacturer_model1_type1.php
类型的所有页面的内容,提取标题,并将结果保存为以下格式的文件:( url,title,)。
修改
这是我到目前为止所做的,但似乎没有效果......
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ArchiveItem(scrapy.Item):
url = scrapy.Field()
class ArchiveSpider(CrawlSpider):
name = 'gsmarena'
allowed_domains = ['gsmarena.com']
start_urls = ['http://www.gsmarena.com/makers.php3']
rules = [
Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
]
def parse_archive(self, response):
torrent = ArchiveItem()
torrent['url'] = response.url
return torrent
答案 0 :(得分:2)
我认为你最好使用BaseSpider而不是CrawlSpider
此代码可能会有所帮助
class GsmArenaSpider(Spider):
name = 'gsmarena'
start_urls = ['http://www.gsmarena.com/makers.php3', ]
allowed_domains = ['gsmarena.com']
BASE_URL = 'http://www.gsmarena.com/'
def parse(self, response):
markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
if markers:
for marker in markers:
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)
def parse_marker(self, response):
url = response.url
# extracting phone urls
phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
if phones:
for phone in phones:
# change callback function name as parse_events for first crawl
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
else:
return
# pagination
next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
if next_page:
yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)
def parse_phone(self, response):
# extract whatever stuffs you want and yield items here
pass
修改强>
如果您想要跟踪这些电话网址的来源,您可以将 meta 从解析 传递到parse_phone 到 parse_marker 然后请求看起来像
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})