这是我的蜘蛛文件amzspider.py
import sys
from scrapy.http import Request
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class amazonScraperSpider(BaseSpider):
name = "Amazon_Scraper"
allowed_domains = ["amazon.com"]
urls=[]
def __init__(self,url,product_file,asin_file):
self.product_file=product_file
self.asin_file=asin_file
self.url=[url]
self.start_urls = [url]
def parse(self, response):
hxs = HtmlXPathSelector(response)
Tops = hxs.select("//*[@class='zg_more_link']/@href").extract()
Tops.append = self.url
for Top in Tops:
yield Request(Top, callback = self.parseTopsPages)
def parseTopsPages(self, response):
hxs = HtmlXPathSelector(response)
PageLinks = hxs.select("//div[@id='zg_paginationWrapper']//li/a/@href").extract()
for PageLink in PageLinks:
yield Request(PageLink, callback = self.parseProducts)
def parseProducts(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select("//div[@class='zg_itemWrapper']//div[@class='zg_title']/a/@href").extract()
for productlink in products:
x = productlink.strip(' \t\n\r')
x1 = '/'.join(x.split('/')[:6])
self.urls.append(x1)
self.save()
def save(self):
f=open(self.product_file,"w")
f1=open(self.asin_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
for url in self.urls:
f.write(url.replace("http://www.","")+"\n")
f.flush()
for url in self.urls:
f.write("http://www.amazon.com/gp/product/" + url.split("/")[-1]+"\n")
f.flush()
for url in self.urls:
f.write("amazon.com/gp/product/" + url.split("/")[-1]+"\n")
f.flush()
f.close()
for url in self.urls:
f1.write(url.split("/")[-1]+"\n")
f1.flush()
f1.close()
我从controller.py调用它,我想等待它完成(块线程),然后在完成抓取工作后继续使用controller.py。
我这样称呼它:
spider = amzspider.amazonScraperSpider(url, settings['product_file'], settings['asins_file'])
问题controller.py继续执行没有amzspider.py线程块的代码
答案 0 :(得分:0)
您的main()
函数只创建一个实例;它实际上并没有让它做任何事情。你应该打电话:
spider = amzspider.amazonScraperSpider(url, settings['product_file'], settings['asins_file'])
controller.py
中的。这实际上可以让您访问实例;你不需要main()
。然后,您可以使用实例:
response = get_a_response() # whatever you do here
spider.parse(response) # give the spider work to do
等。
答案 1 :(得分:0)
如果没有看到如何创建线程的代码,您似乎想在运行join()
的线程上调用amzspider.main
。
其他线程可以调用线程的join()方法。这将阻塞调用线程,直到调用其join()方法的线程终止。