我在使用scrapy迭代爬网时遇到问题。我正在提取标题字段和内容字段。问题是我得到了一个JSON文件,其中列出了所有标题,然后是所有内容。我想获得{title},{content},{title},{content},这意味着我可能需要遍历解析函数。问题是我无法弄清楚我循环的元素(即for x in [???]
)这是代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import SitemapSpider
from Foo.items import FooItem
class FooSpider(SitemapSpider):
name = "foo"
sitemap_urls = ['http://www.foo.com/sitemap.xml']
#sitemap_rules = [
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = [
item = FooItem()
item['title'] = hxs.select('//span[@class="headline"]/text()').extract()
item['content'] = hxs.select('//div[@class="articletext"]/text()').extract()
items.append(item)
return items
答案 0 :(得分:2)
您的xpath查询会返回页面上的所有标题和所有内容。我想你可以做到:
titles = hxs.select('//span[@class="headline"]/text()').extract()
contents = hxs.select('//div[@class="articletext"]/text()').extract()
for title, context in zip(titles, contents):
item = FooItem()
item['title'] = title
item['content'] = context
yield item
但它不可靠。尝试执行返回带有title
和content
的块的xpath查询。如果您向我展示了xml源代码,我会帮助您。
blocks = hxs.select('//div[@class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.select('span[@class="headline"]/text()').extract()
item['content'] = block.select('div[@class="articletext"]/text()').extract()
yield item
我不确定xpath查询,但我认为这个想法很明确。
答案 1 :(得分:0)
您不需要HtmlXPathSelector
。 Scrapy已经内置了XPATH选择器。试试这个:
blocks = response.xpath('//div[@class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.xpath('span[@class="headline"]/text()').extract()[0]
item['content'] = block.xpath('div[@class="articletext"]/text()').extract()[0]
yield item