我按照Scrapy官方网站上的dmoz教程来搜索Python书籍和资源的标题,链接和描述。我在教程中使用了完全相同的蜘蛛,其中包括:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
运行正常,如果我用打印替换yield,可以在控制台上打印数据。
但是当我尝试使用以下命令将已删除的数据存储在csv文件中时出现问题:scrapy dmoz -o items.csv -t csv
。新创建的csv文件仅包含标题和链接的数据,而desc的列为空。有人可以告诉我为什么吗?
答案 0 :(得分:2)
这里有多个问题。
首先,//ul/li
定位器在这种情况下不是最好的,因为它也会匹配没有描述的顶级菜单和子菜单。
此外,将使用您需要修剪的所有额外空格和换行符来检索描述,以获得干净的结果。最“废话”的方法是将Item Loaders与输入和输出处理器一起使用。
完整代码:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = Join()
default_item_class = DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul[@class="directory-url"]/li'):
loader = DmozItemLoader(selector=sel)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
执行
后$ scrapy runspider myspider.py -o items.csv -t csv
这是items.csv
中的内容:
title,link,desc
Core Python Programming,"http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html"," - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall] "
Data Structures and Algorithms with Object-Oriented Design Patterns in Python,http://www.brpreiss.com/books/opus7/html/book.html," - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. "
...
Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython,http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1," - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website] "
Python: Visual QuickStart Guide,"http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html"," - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall] "
Sams Teach Yourself Python in 24 Hours,http://www.informit.com/store/product.aspx?isbn=0672317354," - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing] "
Text Processing in Python,http://gnosis.cx/TPiP/," - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.] "
XML Processing with Python,http://www.informit.com/store/product.aspx?isbn=0130211192," - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] "