Scrapy dmoz教程,csv文件中没有desc数据

时间:2015-10-07 16:11:13

标签: python web-scraping scrapy

我按照Scrapy官方网站上的dmoz教程来搜索Python书籍和资源的标题,链接和描述。我在教程中使用了完全相同的蜘蛛,其中包括:

import scrapy 
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

运行正常,如果我用打印替换yield,可以在控制台上打印数据。

但是当我尝试使用以下命令将已删除的数据存储在csv文件中时出现问题:scrapy dmoz -o items.csv -t csv。新创建的csv文件仅包含标题和链接的数据,而desc的列为空。有人可以告诉我为什么吗?

1 个答案:

答案 0 :(得分:2)

这里有多个问题。

首先,//ul/li定位器在这种情况下不是最好的,因为它也会匹配没有描述的顶级菜单和子菜单。

此外,将使用您需要修剪的所有额外空格和换行符来检索描述,以获得干净的结果。最“废话”的方法是将Item Loaders与输入和输出处理器一起使用。

完整代码:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join


class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class DmozItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = Join()

    default_item_class = DmozItem


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul[@class="directory-url"]/li'):
            loader = DmozItemLoader(selector=sel)

            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')

            yield loader.load_item()

执行

$ scrapy runspider myspider.py -o items.csv -t csv

这是items.csv中的内容:

title,link,desc
Core Python Programming,"http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html"," - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall] "
Data Structures and Algorithms with Object-Oriented Design Patterns in Python,http://www.brpreiss.com/books/opus7/html/book.html," - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. "
...
Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython,http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1," - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website] "
Python: Visual QuickStart Guide,"http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html"," - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall] "
Sams Teach Yourself Python in 24 Hours,http://www.informit.com/store/product.aspx?isbn=0672317354," - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing] "
Text Processing in Python,http://gnosis.cx/TPiP/," - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.] "
XML Processing with Python,http://www.informit.com/store/product.aspx?isbn=0130211192," - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] "