我只想抓取最新新闻文章的标题和黑客新闻链接。
这是我的代码:
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class HnItem(scrapy.Item):
title=scrapy.Field()
link=scrapy.Field()
class HnSpider(scrapy.Spider):
name="hn"
allowed_domains=["https://news.ycombinator.com"]
start_urls=["https://news.ycombinator.com/"]
def parse(self,response):
item=HnItem()
item['title'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/text()').extract()
item['link'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/@href').extract()
print item['title']
print item['link']
但是这会返回一个空列表。
P.S。我是python和scrapy的初学者。
答案 0 :(得分:0)
这是我在尝试创建蜘蛛时最终得到的结果:
import scrapy
class HnItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
class HnSpider(scrapy.Spider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com'] # see Javier's comment
start_urls = ['http://news.ycombinator.com/']
def parse(self,response):
sel = scrapy.Selector(response)
item=HnItem()
# These xPaths can probably be made more generic
item['title'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract()
item['link'] = sel.xpath("//tr[@class='athing']/td[3]/a/@href").extract()
# Do whatever you want with the item. Print,return, etc..
print item['title']
print item['link']
您可以使用以下命令行运行此命令:scrapy runspider path/to/your_spider.py