我希望Scrapy能够遍历每个项目,以便将相关数据组合在一起。因为它只是将所有链接,标题,日期等放在一起。它还会不止一次地将所有内容发布到文件中。我对Scrapy和Python都很陌生,所以任何建议我都会感激不尽。
这是我的蜘蛛代码:
from scrapy.spiders import Spider
from scrapy.selector import Selector
from fashioBlog.functions import extract_data
from fashioBlog.items import Fashioblog
class firstSpider(Spider):
name = "first"
allowed_domains = [
"stopitrightnow.com"
]
start_urls = [
"http://www.stopitrightnow.com"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="post-outer"]')
items= []
for site in sites:
item = Fashioblog()
item['title'] = extract_data(site.xpath('//h3[normalize-space(@class)="post-title entry-title"]//text()').extract())
item['url'] = extract_data(site.xpath('//div[normalize-space(@class)="post-body entry-content"]//@href').extract())
item['date'] = extract_data(site.xpath('//h2[normalize-space(@class)="date-header"]/span/text()').extract())
#item['body'] = site.xpath('//div[@class="post-body entry-content"]/i/text()').extract()
item['labelLink'] = extract_data(site.xpath('//span[normalize-space(@class)="post-labels"]//@href').extract())
item['comment'] = extract_data(site.xpath('//span[normalize-space(@class)="post-comment-link"]//text()').extract())
item['picUrl'] = extract_data(site.xpath('//div[normalize-space(@class)="separator"]//@href').extract())
#item['labelText'] = extract_data(site.xpath('(//i//text()').extract())
#item['labelLink2'] = extract_data(site.xpath('(//i//@href').extract())
yield item
答案 0 :(得分:2)
通过添加一个点来表达特定于上下文的:
item['title'] = extract_data(site.xpath('.//h3[normalize-space(@class)="post-title entry-title"]//text()').extract())
^ HERE