Python中的Scrapy迭代问题

时间:2015-05-11 02:18:13

标签: python postgresql python-2.7 scrapy

我在使用Scrapy时遇到了一些问题,我正在使用newcoder教程,似乎陷入了迭代困境。 这里的教程 http://newcoder.io/scrape

我正在努力:http://freefuninaustin.com/

我能够轻松获得所有标题:             ' title':' // h3 [@class =" content-list-title"] // @ title'

然而,每当我运行刮刀时,它会获取每个帖子的所有标题并将它们输入到我的数据库中。我希望它为每个帖子提取一个标题并输入到数据库中。

蜘蛛本身的代码:

deals_list_xpath = '//article'
item_fields = {
    'title': '//h3[@class="content-list-title"]//@title'

def parse(self, response):
    """
    Default callback used by Scrapy to process downloaded responses

    Testing contracts:
    @url http://www.freefuninaustin.com/blog/
    @returns items 1
    @scrapes title 

    """
    selector = HtmlXPathSelector(response)

    # iterate over deals
    for deal in selector.xpath(self.deals_list_xpath):
        loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

        # define processors
        loader.default_input_processor = MapCompose(unicode.strip)
        loader.default_output_processor = Join()

        # iterate over fields and add xpaths to the loader
        for field, xpath in self.item_fields.iteritems():
            loader.add_xpath(field, xpath)
        yield loader.load_item()

现在是管道

def process_item(self, item, spider):
    """Save deals in the database.

    This method is called for every item pipeline component.

    """
    session = self.Session()
    deal = Deals(**item)

    try:
        session.add(deal)
        session.commit()
    except:
        session.rollback()
        raise
    finally:
        session.close()

    return item

来自scrapy的结果

     loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>

这就是它的外观

2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'1 or 3 Private Golf Lessons'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Los Angeles Dodgers at Oakland Athletics on August 18'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Glycolic or Salicylic Glow Facial Peel'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Boston Red Sox at Oakland Athletics on May 11'}

如何才能在每个帖子中仅提取一次标题?

1 个答案:

答案 0 :(得分:0)

在XPath表达式的开头添加.(点)以使其成为&#34;特定于上下文的&#34;:

item_fields = {
    'title': './/h3[@class="content-list-title"]//@title'
}

还有不同的&#34;类型&#34;在页面上article元素并处理它们时,您需要将表达式重写为:

.//h3[@class="content-list-title" or @class="cp-title-small"]//@title