晚上好,谢天谢地。
我正在挖掘Scrappy,我的需求是从网站获取信息并重新创建网站的相同树结构。 例如:
books [
python [
first [
title = 'Title'
author = 'John Doe'
price = '200'
]
first [
title = 'Other Title'
author = 'Mary Doe'
price = '100'
]
]
php [
first [
title = 'PhpTitle'
author = 'John Smith'
price = '100'
]
first [
title = 'Php Other Title'
author = 'Mary Smith'
price = '300'
]
]
]
从教程我已经正确完成了我的基本蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from pippo.items import PippoItem
class PippoSpider(BaseSpider):
name = "pippo"
allowed_domains = ["www.books.net"]
start_urls = [
"http://www.books.net/index.php"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="28008_LeftPane"]/div/ul/li')
items = []
for site in sites:
item = PippoItem()
item['subject'] = site.select('a/b/text()').extract()
item['link'] = site.select('a/@href').extract()
items.append(item)
return items
我的问题是,我的结构的任何级别都在网站的某个级别更深,所以如果在我的基本级别,我得到书籍的主题,然后我需要抓取相应的itemitem ['link']以获取其他项目。但是在接下来的网址中,我需要一个不同的HtmlXPathSelector来正确地提取我的数据,依此类推,直到结构结束。
请你帮我一下,把我放在一边吗? 谢谢。
答案 0 :(得分:1)
您需要手动制作链接的请求:(另见CrawlSpider)
from urlparse import urljoin
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from pippo.items import PippoItem
class PippoSpider(BaseSpider):
name = "pippo"
allowed_domains = ["www.books.net"]
start_urls = ["http://www.books.net/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="28008_LeftPane"]/div/ul/li')
for site in sites:
item = PippoItem()
item['subject'] = site.select('.//text()').extract()
item['link'] = site.select('.//a/@href').extract()
link = item['link'][0] if len(item['link']) else None
if link:
yield Request(urljoin(response.url, link),
callback=self.parse_link,
errback=lambda _: item,
meta=dict(item=item),
)
else:
yield item
def parse_link(self, response):
item = response.meta.get('item')
item['alsothis'] = 'more data'
return item