我在使用Scrapy时遇到了一些问题,我正在使用newcoder教程,似乎陷入了迭代困境。 这里的教程 http://newcoder.io/scrape
我正在努力:http://freefuninaustin.com/
我能够轻松获得所有标题: ' title':' // h3 [@class =" content-list-title"] // @ title'
然而,每当我运行刮刀时,它会获取每个帖子的所有标题并将它们输入到我的数据库中。我希望它为每个帖子提取一个标题并输入到数据库中。
蜘蛛本身的代码:
deals_list_xpath = '//article'
item_fields = {
'title': '//h3[@class="content-list-title"]//@title'
def parse(self, response):
"""
Default callback used by Scrapy to process downloaded responses
Testing contracts:
@url http://www.freefuninaustin.com/blog/
@returns items 1
@scrapes title
"""
selector = HtmlXPathSelector(response)
# iterate over deals
for deal in selector.xpath(self.deals_list_xpath):
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
现在是管道
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
来自scrapy的结果
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
这就是它的外观
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'1 or 3 Private Golf Lessons'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Los Angeles Dodgers at Oakland Athletics on August 18'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Glycolic or Salicylic Glow Facial Peel'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Boston Red Sox at Oakland Athletics on May 11'}
如何才能在每个帖子中仅提取一次标题?
答案 0 :(得分:0)
在XPath表达式的开头添加.
(点)以使其成为&#34;特定于上下文的&#34;:
item_fields = {
'title': './/h3[@class="content-list-title"]//@title'
}
还有不同的&#34;类型&#34;在页面上article
元素并处理它们时,您需要将表达式重写为:
.//h3[@class="content-list-title" or @class="cp-title-small"]//@title