crawlspider不使用文本文件中的URL进行爬网

时间:2015-03-11 07:03:12

标签: python web-scraping scrapy forum

问题陈述:

我在每行的文件名myurls.csv中有一个论坛网址列表,如下所示:

https://www.drupal.org/user/3178461/track
https://www.drupal.org/user/511008/track

我编写了一个CrawlSpider代码来抓取论坛帖子,如下所示:

class fileuserurl(CrawlSpider):
    name = "fileuserurl"
    allowed_domains = []
    start_urls = []

    rules = (
    Rule(SgmlLinkExtractor(allow=('/user/\d/track'),
    restrict_xpaths = ('//li[@class="pager-next"]',),
    canonicalize=False ),callback='parse_page',follow=True)
    )

    def __init__(self):
    f = open('./myurls.txt','r').readlines()
    self.allowed_domains = ['www.drupal.org']
    self.start_urls = [l.strip() for l in f]
    super(fileuserurl,self).__init__()

    def parse_page(self, response):
    print '*********** START PARSE_PAGE METHOD**************'
    # print response.url
    items = response.xpath("//tbody/tr")
    myposts=[]
    for temp in items:
    item = TopicPosts()
    item['topic'] = temp.xpath(".//td[2]/a/text()").extract()
    relative_url = temp.xpath(".//td[2]/a/@href").extract()[0]
    item['topiclink'] = 'https://www.drupal.org'+relative_url
    item['author'] = temp.xpath(".//td[3]/a/text()").extract()
    try:
    item['replies'] = str(temp.xpath(".//td[4]/text()").extract()[0]).strip('\n')
    except:
    continue
    myposts.append(item)
    return myposts

问题:

它只给我文本文件中提到的网址的第一页输出。我想转到首页下一页定义的每个页面链接。

1 个答案:

答案 0 :(得分:0)

相反,定义start_requests() method

def start_requests(self):
    with open('./myurls.txt','r') as f:
        for url in f:
            url = url.strip()
            yield scrapy.Request(url)

而且,您需要将rules定义为可迭代的。另外,allow中的正则表达式应该允许多个数字(\d+而不是\d):

rules = [
    Rule(SgmlLinkExtractor(allow='/user/\d+/track', restrict_xpaths='//li[@class="pager-next"]', canonicalize=False),
         callback='parse_page',
         follow=True)
]