我正在使用Scrapy从网站的第一页抓取信息,我将数据导出到.csv文件中,如下所示:
scrapy crawl spidername -o data.csv
我想获得一个表单输出:
{'Title': [u'Message'],
'Link': [u'url'],
'Text': [u'Hello World']}
{...........
.....} etc
但相反,我将所有东西都放在一个{}中,即
{[all 'Title' data],
[all 'Link' data],
[all 'Text' data]}
我的scrapy蜘蛛如下:
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.http import Request
from MySpiderProject.items import MyspiderprojectItem
class MySpiderProjectSpider(CrawlSpider):
name = "scrapylist"
allowed_domains = ["url"]
start_urls = [
"url/companies/"
]
def parse(self, response):
for sel in response.xpath('xpath containing each data item'):
item = MySpiderProjectItem()
item['Title'] = sel.xpath('xpath for title').extract()
item['Link'] = sel.xpath('xpath for link').extract()
item['Text'] = sel.xpath('xpath for text').re('[^\t\n]+')
yield item
我正在抓取以下网址:http://scrapy.org/companies/ 并且xpath表达式是:
response.xpath('//div[@class="company-box"]'):
response.xpath('//div[@class="companies-container"]'):
response.xpath('//p/span[@class="highlight"]/text()').extract()
response.xpath('//a/@href').extract()
response.xpath('//p//text()').re('[^\t\n]+')
据我所知,它们会产生正确的输出。
有人可以解释这里出了什么问题吗?
答案 0 :(得分:1)
您正在使用absolute XPaths,它将返回整个文件中的所有匹配代码,而不仅仅是您的选择器子代。
我尝试复制您的项目,这会生成所需的CSV文件:
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.http import Request
from MySpiderProject.items import MyspiderprojectItem
class MySpiderProjectSpider(CrawlSpider):
name = "scrapylist"
start_urls = [
"http://scrapy.org/companies/"
]
def parse(self, response):
for sel in response.css(".company-box"):
item = MyspiderprojectItem()
item['Title'] = sel.css(".highlight ::text").extract_first()
item['Link'] = sel.css('a::attr(href)').extract_first()
item['Text'] = sel.xpath('.//p//text()').re('[^\t\n]+')
yield item
我用CSS选择器替换了XPath,因为它们似乎更容易使用,而且在querying for a class时它们也更强大。我用这些参数运行它:
$ scrapy crawl scrapylist -o data.csv