使用Scrapy基于存储在csv中的关键字有条理地删除域

时间:2015-02-04 07:21:04

标签: python web-scraping scrapy

我有一个存储在csv中的关键字列表,我需要在不同的域中搜索并抓取信息。

我试图按照以下顺序指示我的蜘蛛跟随它们:{example1.com,example2.com,example3.com和example4.com}

如果Spider在前一个域中找不到匹配项,则只会进入下一个域。如果在任何这些域中找到关键字的匹配项,则接下来将从我的csv中选择下一个关键字,然后从example1.com重新开始搜索

重要的是,我还要求从csv中挑选的特定关键字存储在其中一个项目字段中。

到目前为止,我的代码是:

item = ExampleItem()
f = open("InputKeywords.csv")
csv_file = csv.reader(f)
productname_list = []
for row in csv_file:
  productname_list.append(row[1])

class MySpider(CrawlSpider):
  name = "test1"
  allowed_domains = ["example1.com", "example2.com", "example3.com"]

  def start_requests(self):
    for keyword in productname_list:              
      item ["Product_Name"] = keyword  #loading the searched keyword in my output 
      request=Request("http://www.example1.com/search?noOfResults=20&keyword="+str(keyword),self.Example1)      
      yield request
      if item ["Example1"] == "No Image Found on Example1":
        request=Request("www.Example2.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="+str(keyword),self.Example2)
        yield request
          if item ["Example2"] == "No Image Found on Example3":
           request=Request("http://www.Example3.com/search?noOfResults=20&keyword="+str(keyword),self.Example3):
           yield request


  def Example1(self,response): 
    sel = Selector(response)        
    result =  response.xpath("//div[@class='hoverProductImage product-image '][1]/a/@href")  #Checking if the Search Term Exists on Domain
    if result: 
      request = Request(result.extract()[0],callback=self.Example1Items) #For Parsing Information if search keyword found 
      request.meta["item"] = item
      return request
    else:
      item ["Example1"] = "No Image Found at Example1"  
      return item

  def Example1Items(self,response):
    sel = Selector(response)
    item = response.meta['item']
        item ["Example1"] = sel.xpath("//meta[@name='og_image']/@content").extract()
        return item

  def Example2(self,response):
        sel = Selector(response)    
        result=  response.xpath("//div[@class='a-row a-spacing-small'][1]/a/@href")
        if result:
          request = Request(result.extract()[0],callback=self.Example2Items)
          request.meta["item"] = item
          return request
        else:
          item ["Example2"] = "No Image Found at Example2"
          return item

 def Example2Items(self,response):
        sel = Selector(response)
        item = response.meta['item']
        item ["Example2"] = sel.xpath("//div[@class='ap_content']/div/div/div/img/@src").extract()
        return item

----CODE FOR EXAMPLE 3 and EXAMPLE 3 Items----

我的代码远非正确,但我面临的第一个错误是我的关键字没有以与输入csv相同的顺序存储。 我也无法根据未找到的条件执行example2或示例3搜索的逻辑。

任何帮助都将不胜感激。

基本上我需要我的输出,我将在csv中存储看起来像这样:

{
"Keyword1", "Example1Found","","",

"Keyword2", "No Image Found at Example1","No Image Found at Example2","Example3Found",

"Keyword3", "No Image Found at Example1","Example2Found","",
}

0 个答案:

没有答案