我有HTML元素,如下所示:
我想对h1
,div.article-meta
和div.article-content
进行分组,因此我可以在我的Scrapy项目中逐行循环写入数据。
我想把它们中的每一个分组成var,然后循环var,我不确定如何去做。
请建议。谢谢,
到目前为止,我已经尝试过这个:
def parse(self, response):
now = time.strftime('%Y-%m-%d %H:%M:%S')
hxs = scrapy.Selector(response)
titles = hxs.xpath('//div[@class="list-article"]/h1')
images = hxs.xpath('//div[@class="list-article"]/feature-image')
contents = hxs.xpath('//div[@class="list-article"]/article-content')
for i, title in titles:
item = DapnewsItem()
item['categoryId'] = '1'
name = titles[i].xpath('a/text()')
if not name:
print('DAP => [' + now + '] No title')
else:
item['name'] = name.extract()[0]
description = contents[i].xpath('p/text()')
if not description:
print('DAP => [' + now + '] No description')
else:
item['description'] = description[1].extract()
url = titles[i].xpath("a/@href")
if not url:
print('DAP => [' + now + '] No url')
else:
item['url'] = url.extract()[0]
imageUrl = images[i].xpath('img/@src')
if not imageUrl:
print('DAP => [' + now + '] No imageUrl')
else:
item['imageUrl'] = imageUrl.extract()[0]
yield item
这是我得到的错误。
答案 0 :(得分:1)
让我们使用这个HTML代码段来说明:
<div class="list-article">
<h1><a href="http//www.example.com/article1.html">Title 1</h1>
<div class="article-meta">Something for 1</div>
<div class="feature-image"><img src="http://www.example.com/image1.jpg"></div>
<div class="article-content"><p>Content 1</p></div>
<h1><a href="http//www.example.com/article2.html">Title 2</h1>
<div class="article-meta">Something for 2</div>
<div class="feature-image"><img src="http://www.example.com/image2.jpg"></div>
<div class="article-content"><p>Content 2</p></div>
<h1><a href="http//www.example.com/article3.html">Title 3</h1>
<div class="article-meta">Something for 3</div>
<div class="feature-image"><img src="http://www.example.com/image3.jpg"></div>
<div class="article-content"><p>Content 3</p></div>
</div>
您可以循环访问每个<h1>
并使用XPath's following-sibling
axis检查树中同一级别后面的元素,然后对第一个元素进行过滤:例如:第一个following-sibling::div[@class="feature-image"][1]
<div class="feature-image">
>>> selector = scrapy.Selector(text='''<div class="list-article">
...
... <h1><a href="http//www.example.com/article1.html">Title 1</h1>
... <div class="article-meta">Something for 1</div>
... <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div>
... <div class="article-content"><p>Content 1</p></div>
...
... <h1><a href="http//www.example.com/article2.html">Title 2</h1>
... <div class="article-meta">Something for 2</div>
... <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div>
... <div class="article-content"><p>Content 2</p></div>
...
... <h1><a href="http//www.example.com/article3.html">Title 3</h1>
... <div class="article-meta">Something for 3</div>
... <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div>
... <div class="article-content"><p>Content 3</p></div>
...
... </div>''')
>>> for h in selector.css('div.list-article > h1'):
... item = {
... 'title': h.xpath('a/text()').extract_first(),
... 'image': h.xpath('''
... following-sibling::div[@class="feature-image"][1]
... /img/@src''').extract_first(),
... 'content': h.xpath('''
... following-sibling::div[@class="article-content"][1]
... /p/text()''').extract_first(),
... }
... print(item)
...
{'content': u'Content 1', 'image': u'http://www.example.com/image1.jpg', 'title': u'Title 1'}
{'content': u'Content 2', 'image': u'http://www.example.com/image2.jpg', 'title': u'Title 2'}
{'content': u'Content 3', 'image': u'http://www.example.com/image3.jpg', 'title': u'Title 3'}
>>>