这是我的代码,有人可以请求帮助,由于某些原因蜘蛛运行但实际上并没有抓取论坛帖子。我试图在我的开始URL中提取特定论坛的论坛帖子中的所有文本。
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[@class='value']/text()").extract()
return item
日志输出:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
答案 0 :(得分:1)
作为第一个调整,您需要通过添加“。”来修改您的第一条规则。在正则表达式的开头,如下。我还将开始网址更改为论坛的实际首页。
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
我已经更新了规则,以便蜘蛛现在可以抓取线程中的所有页面。
编辑:我发现了一个可能导致问题的拼写错误,我已经修复了日期xpath。 item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[@class='post-author'])[1]//a[@class='internal-link view-post']/text()").extract()
上面的行说“hxs.selec”,应该是“hxs.select”。我改变了这一点,现在可以看到内容被刮掉了。通过反复试验(我对xpaths有点垃圾),我已经设法得到第一篇文章的日期(即创建线程的日期),所以这一切现在都可以了。