我有以下代码来解析网站:
if os.path.isfile(data_content_file):
try:
with open(data_content_file) as data_file:
question_answer = json.load(data_file)
except Exception as e:
question_answer = {}
else:
question_answer = {}
if os.path.isfile(count_file):
f = open(count_file, 'r')
try:
start = int(f.read())
except Exception as e:
start = 1
f.close()
else:
start = 1
f = open(count_file, 'w+')
for x in xrange(start,500000):
try:
print(x)
f.seek(0)
f.truncate()
f.write(str(x))
req = urllib2.Request("https://islamqa.info/en/"+str(x), headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
soup = BeautifulSoup(con.read(),"lxml")
我不知道为什么它会被某些x值冻结。
如果我停止我的脚本并再次运行相同的x值,它运行正常。
我尝试使用超时,但它没有加载任何页面,即使超时是1000:
req = urllib2.Request("https://islamqa.info/en/"+str(x), headers={'User-Agent' : "Magic Browser"},timeout=10000)
避免这种情况或继续循环,甚至网站冻结的最佳方法是什么。