我正在抓this site而我正在使用Scrapy作为手段。但是,我遇到了XPath问题。我不完全确定发生了什么:
为什么这样做:
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
yield item
以下代码没有?
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//body'):
item['title'] = title.xpath('h1/strong/text()').extract()
item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
yield item
我的目标也是提取XPath:
//div[@id="description"]/p
但我不能,因为它在h1
节点之外。我怎样才能做到这一点?我的完整代码是:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bot.items import BotItem
class MufmufSpider(CrawlSpider):
name = 'mufmuf'
allowed_domains = ['mufmuf.ro']
start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']
rules = (
Rule(
LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'),
#callback='parse_start_url',
follow=True
),
Rule(
LinkExtractor(restrict_xpaths='//h3/a'),
callback='parse_item',
follow=True
),
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
#item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
yield item
答案 0 :(得分:4)
for title in response.xpath('//body'):
选项不起作用,因为循环中的XPath表达式使其直接在h1
元素内搜索body
元素。
此外,由于只提取了一个所需的实体,因此根本不需要循环:
def parse_item(self, response):
item = BotItem()
item["title"] = response.xpath('//h1/strong/text()').extract()
item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()
return item
(这也应该回答关于description
)的第二个问题