Scrapy - 仅限最后结果

时间:2016-04-23 21:18:36

标签: python scrapy

除了最后一个问题,我几乎得到了这个scrapy程序。我正试图

  1. 遍历页面上每个条目的列表
  2. 在每个条目的第一个列表页面中提取一段数据['RStation']
  3. 通过其href
  4. 输入每个条目的网址
  5. 通过迭代下一页的列表来提取一些数据
  6. 使用主页和后续页面中的数据创建单个项目
  7. 问题在于,当我打开我的csv时,我只看到第二个迭代列表中最后一个条目的重复项(对于第一个列表的每个条目)。

    我是否错误地追加了这些项目或以某种方式误用了response.meta?我试着按照response.meta的文档进行操作,但我无法理解为什么这不起作用。

    非常感谢任何帮助。

    import scrapy
    from scrapy.selector import Selector
    from scrapy.http import HtmlResponse
    from fspeople.items import FspeopleItem
    
    class FSSpider(scrapy.Spider):
    name = "fspeople"
    allowed_domains = ["fs.fed.us"]
    start_urls = [
        "http://www.fs.fed.us/research/people/people_search_results.php?3employeename=&keywords=&station_id=SRS&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=RMRS&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PSW&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PNW&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=NRS&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=IITF&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=FPL&state_id=ALL",
        #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=WO&state_id=ALL"
    ]
    def __init__(self):
        self.i = 0
    
    def parse(self,response):
        for sel in response.xpath("//a[@title='Click to view their profile ...']/@href"):
            item = FspeopleItem()
            url = response.urljoin(sel.extract())
            item['RStation'] = response.xpath("//table[@id='table_id']/tbody/tr/td[2]/i/b/text() | //table[@id='table_id']/tbody/td[2]/text()").extract_first().strip()
            request = scrapy.Request(url, callback=self.parse_post)
            request.meta['item'] = item
            yield request
        self.i += 1
    
    def parse_post(self, response):
        theitems = []
        pubs = response.xpath("//div/h2[text()='Featured Publications & Products']/following-sibling::ul[1]/li | //div/h2[text()='Publications']/following-sibling::ul[1]/li")
        for i in pubs:
            item = response.meta['item']
            name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip()
            pubname = i.xpath("a/text()").extract_first().strip()
            pubauth = i.xpath("text()").extract_first().strip()
            pubURL = i.xpath("a/@href").extract_first().strip()
            #RStation = response.xpath("//div[@id='right-float']/div/div/ul/li/a/text()").extract_first().strip()
    
            item['link'] = response.url
            item['name'] = name
            item['pubname'] = pubname
            item['pubauth'] = pubauth
            item['pubURL'] = pubURL
            #item['RStation'] = RStation
    
            theitems.append(item)
        return theitems
    

1 个答案:

答案 0 :(得分:0)

为每次迭代创建一个新的项目实例。

def parse_post(self, response):
    [...]
    for i in pubs:
        item = response.meta['item']
        item = item.copy()
        [...]