问题陈述:
我在每行的文件名myurls.csv中有一个论坛网址列表,如下所示:
https://www.drupal.org/user/3178461/track
https://www.drupal.org/user/511008/track
我编写了一个CrawlSpider代码来抓取论坛帖子,如下所示:
class fileuserurl(CrawlSpider):
name = "fileuserurl"
allowed_domains = []
start_urls = []
rules = (
Rule(SgmlLinkExtractor(allow=('/user/\d/track'),
restrict_xpaths = ('//li[@class="pager-next"]',),
canonicalize=False ),callback='parse_page',follow=True)
)
def __init__(self):
f = open('./myurls.txt','r').readlines()
self.allowed_domains = ['www.drupal.org']
self.start_urls = [l.strip() for l in f]
super(fileuserurl,self).__init__()
def parse_page(self, response):
print '*********** START PARSE_PAGE METHOD**************'
# print response.url
items = response.xpath("//tbody/tr")
myposts=[]
for temp in items:
item = TopicPosts()
item['topic'] = temp.xpath(".//td[2]/a/text()").extract()
relative_url = temp.xpath(".//td[2]/a/@href").extract()[0]
item['topiclink'] = 'https://www.drupal.org'+relative_url
item['author'] = temp.xpath(".//td[3]/a/text()").extract()
try:
item['replies'] = str(temp.xpath(".//td[4]/text()").extract()[0]).strip('\n')
except:
continue
myposts.append(item)
return myposts
问题:
它只给我文本文件中提到的网址的第一页输出。我想转到首页下一页定义的每个页面链接。
答案 0 :(得分:0)
相反,定义start_requests()
method:
def start_requests(self):
with open('./myurls.txt','r') as f:
for url in f:
url = url.strip()
yield scrapy.Request(url)
而且,您需要将rules
定义为可迭代的。另外,allow
中的正则表达式应该允许多个数字(\d+
而不是\d
):
rules = [
Rule(SgmlLinkExtractor(allow='/user/\d+/track', restrict_xpaths='//li[@class="pager-next"]', canonicalize=False),
callback='parse_page',
follow=True)
]