我有一个存储在csv中的关键字列表,我需要在不同的域中搜索并抓取信息。
我试图按照以下顺序指示我的蜘蛛跟随它们:{example1.com,example2.com,example3.com和example4.com}
如果Spider在前一个域中找不到匹配项,则只会进入下一个域。如果在任何这些域中找到关键字的匹配项,则接下来将从我的csv中选择下一个关键字,然后从example1.com重新开始搜索
重要的是,我还要求从csv中挑选的特定关键字存储在其中一个项目字段中。
到目前为止,我的代码是:
item = ExampleItem()
f = open("InputKeywords.csv")
csv_file = csv.reader(f)
productname_list = []
for row in csv_file:
productname_list.append(row[1])
class MySpider(CrawlSpider):
name = "test1"
allowed_domains = ["example1.com", "example2.com", "example3.com"]
def start_requests(self):
for keyword in productname_list:
item ["Product_Name"] = keyword #loading the searched keyword in my output
request=Request("http://www.example1.com/search?noOfResults=20&keyword="+str(keyword),self.Example1)
yield request
if item ["Example1"] == "No Image Found on Example1":
request=Request("www.Example2.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="+str(keyword),self.Example2)
yield request
if item ["Example2"] == "No Image Found on Example3":
request=Request("http://www.Example3.com/search?noOfResults=20&keyword="+str(keyword),self.Example3):
yield request
def Example1(self,response):
sel = Selector(response)
result = response.xpath("//div[@class='hoverProductImage product-image '][1]/a/@href") #Checking if the Search Term Exists on Domain
if result:
request = Request(result.extract()[0],callback=self.Example1Items) #For Parsing Information if search keyword found
request.meta["item"] = item
return request
else:
item ["Example1"] = "No Image Found at Example1"
return item
def Example1Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example1"] = sel.xpath("//meta[@name='og_image']/@content").extract()
return item
def Example2(self,response):
sel = Selector(response)
result= response.xpath("//div[@class='a-row a-spacing-small'][1]/a/@href")
if result:
request = Request(result.extract()[0],callback=self.Example2Items)
request.meta["item"] = item
return request
else:
item ["Example2"] = "No Image Found at Example2"
return item
def Example2Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example2"] = sel.xpath("//div[@class='ap_content']/div/div/div/img/@src").extract()
return item
----CODE FOR EXAMPLE 3 and EXAMPLE 3 Items----
我的代码远非正确,但我面临的第一个错误是我的关键字没有以与输入csv相同的顺序存储。 我也无法根据未找到的条件执行example2或示例3搜索的逻辑。
任何帮助都将不胜感激。
基本上我需要我的输出,我将在csv中存储看起来像这样:
{
"Keyword1", "Example1Found","","",
"Keyword2", "No Image Found at Example1","No Image Found at Example2","Example3Found",
"Keyword3", "No Image Found at Example1","Example2Found","",
}