我阅读了Scrapy官方教程,如果我可以使用一些外部库来进行文章提取,那么我并不清楚。
答案 0 :(得分:2)
当然可以。 =)
这是一个让你入门的示例蜘蛛:
import scrapy
from goose import Goose
class Article(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
class MyGooseSpider(scrapy.Spider):
name = 'goose'
start_urls = [
'http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/',
'http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/',
]
def parse(self, response):
article = Goose().extract(raw_html=response.body)
yield Article(title=article.title, text=article.cleaned_text)
将其放入file.py
并运行:
scrapy runspider file.py -o output.json