我正在研究Web抓取程序,我需要使用一段时间,并将其设置为直到,它运行几次,然后当j等于某个数字时,它将布尔值更改为false并退出循环转到下面的函数以实际解析数据。但是,当它准备好获取下一个URL时,我需要它重新进入循环,但是在s仍然为False的情况下,它将无法输入它。如何将s恢复为true?
class MySpider(Spider):
# Name of Spider
name = 'splash_spider'
# getting all the url + ip address + useragent pairs then request them
def start_requests(self):
# get the file path of the csv file that contains the pairs from the settings.py
with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
# requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
requests = process_csv(csv_file)
j = 1
s = True
for i, req in enumerate(requests):
import pdb; pdb.set_trace()
while s == True :
x = len(requests) - i
# Return needed url with set delay of 3 seconds
yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
# Pair with user agent specified in csv file
headers={"User-Agent": req["ua"]},
# Sets splash_url to whatever the current proxy that goes with current URL is instead of actual splash url
splash_url = req["ip"],
priority = x,
meta={'priority': x} # <- check here!!
)
j = j + 1
if j == len(requests):
s = False
j = 1
答案 0 :(得分:2)
不要使用布尔值。使用while True:
,然后使用break
退出循环。
def start_requests(self):
# get the file path of the csv file that contains the pairs from the settings.py
with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
# requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
requests = process_csv(csv_file)
j = 1
for i, req in enumerate(requests):
import pdb; pdb.set_trace()
while True :
x = len(requests) - i
# Return needed url with set delay of 3 seconds
yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
# Pair with user agent specified in csv file
headers={"User-Agent": req["ua"]},
# Sets splash_url to whatever the current proxy that goes with current URL is instead of actual splash url
splash_url = req["ip"],
priority = x,
meta={'priority': x} # <- check here!!
)
j = j + 1
if j == len(requests):
j = 1
break
但是似乎您根本不需要while
或j
,请使用for _ in range(len(requests)):
此外,您应该在内部循环之外设置x
,因为它在该循环中不会改变。
def start_requests(self):
# get the file path of the csv file that contains the pairs from the settings.py
with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
# requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
requests = process_csv(csv_file)
for i, req in enumerate(requests):
import pdb; pdb.set_trace()
x = len(requests) - i
for _ in range(len(requests)):
# Return needed url with set delay of 3 seconds
yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
# Pair with user agent specified in csv file
headers={"User-Agent": req["ua"]},
# Sets splash_url to whatever the current proxy that goes with current URL is instead of actual splash url
splash_url = req["ip"],
priority = x,
meta={'priority': x} # <- check here!!
)
答案 1 :(得分:0)
在我看来,如果您只是简单地在导入ipdb的地方将s
重新分配为false,就可以达到目的:
class MySpider(Spider):
# Name of Spider
name = 'splash_spider'
# getting all the url + ip address + useragent pairs then request them
def start_requests(self):
# get the file path of the csv file that contains the pairs from the settings.py
with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
# requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
requests = process_csv(csv_file)
j = 1
for i, req in enumerate(requests):
s = True
import pdb; pdb.set_trace()
while s == True :
x = len(requests) - i
# Return needed url with set delay of 3 seconds
yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
# Pair with user agent specified in csv file
headers={"User-Agent": req["ua"]},
# Sets splash_url to whatever the current proxy that goes with current URL is instead of actual splash url
splash_url = req["ip"],
priority = x,
meta={'priority': x} # <- check here!!
)
j = j + 1
if j == len(requests):
s = False
j = 1