我正在玩Python和Scrapy库,想法是蜘蛛网,将所需的字段保存到数据库中(在这种情况下是新闻项),不幸的是它目前只保存1个列表项而不是几个。它似乎没有正确迭代。
非常感谢任何帮助
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from scraper_app.items import ListItem
class ListSpider(BaseSpider):
name = "news_list"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/Default/Section/1"]
news_items_xpath = '//*[@id="section-news"]/section/ul/li[1]/div'
item_fields = { 'title': './/div/h3',
'link': './/div/h3/a',
'description': './/div/p/text()',
'date': './/div/div[2]'}
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over deals
for news in selector.select(self.news_items_xpath):
loader = XPathItemLoader(ListItem(), selector=news)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
HTML:
<div id="section-news" class="block secondary">
<section class="inner">
<ul class="thumbs">
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393107/AntiIsraelism-not-antiSemitism"><img src="http://217.218.67.233/photo/20150114/59b5efd9-3c1c-47b1-a014-4ca0fedadeb6.jpg" alt="uk jews" /><i class="icon-play"></i></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393107/AntiIsraelism-not-antiSemitism">‘Anti-Israelism not anti-Semitism’</a></h3>
<div class="date">Wed Jan 14, 2015 7:27PM</div>
<p>A new survey which reveals that nearly half of Britons hold anti-Semitic views.</p>
</div>
</div>
</li>
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393095/Turkey-bans-arms-delivery-reports"><img src="http://217.218.67.233/photo/20150114/2de1eb77-ba2a-49c9-a232-ab4cf82ffc1d.jpg" alt="Syria-militants" /></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393095/Turkey-bans-arms-delivery-reports">Turkey bans arms delivery reports</a></h3>
<div class="date">Wed Jan 14, 2015 7:22PM</div>
<p>Turkey bans media reports on alleged arms delivery to militants in Syria.</p>
</div>
</div>
</li>
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393099/Egypt-Israel-gas-imports-possible"><img src="http://217.218.67.233/photo/20150114/c63935fb-8221-43fc-8103-6f49f013cbfd.jpg" alt="Egypt-Israel" /></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393099/Egypt-Israel-gas-imports-possible">Egypt: Israel gas imports possible</a></h3>
<div class="date">Wed Jan 14, 2015 7:11PM</div>
<p>Egypt says importing gas from Israel is a possibility.</p>
</div>
</div>
</li>
答案 0 :(得分:0)
问题是您的xpath仅限于单个列表条目
news_items_xpath = '//*[@id="section-news"]/section/ul/li[1]/div'
删除[1]
news_items_xpath = '//*[@id="section-news"]/section/ul/li/div'