我的程序设置为根据状态和其他变量下载URL的笛卡尔积,将zip文件(从创建的URL)保存到指定位置,检查zip文件中的数据(某些zip文件下载没有数据),写入关于状态数据的特定文件,然后在状态完成时写入文件。这是基于状态并行完成的,即Alabama和Alaska将并行执行上述操作。但是,我一直收到以下错误:
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (179, 0))
当我开始新鲜时发生错误,即之前没有运行过程。如果我部分运行该过程然后重新开始这不会发生。更具体地说,它随机发生。
这是我的代码:
功能 -
def createURL(state, typ, geography, level, data, dictionary):
DATALIST = list(itertools.product(typ, geography, level, data))
TXTLIST = list(itertools.product(typ, dictionary))
DEFLIST = list(itertools.product(typ))
DATALINKS = []
for data in DATALIST:
result = 'URL'
DATALINKS.append(result)
TXTLINKS = []
for txt in TXTLIST:
links = 'URL'
TXTLINKS.append(links)
DEFLINKS = []
for defl in DEFLIST:
definitions = 'URL'
DEFLINKS.append(definitions)
URLLINKS = DATALINKS + TXTLINKS + DEFLINKS
return URLLINKS
def downloadData(state, TYPE, GEOGRAPHY, LEVEL, DATA, \
DICTIONARY, YEAR, QUARTER, completedStates):
print ('Working on state: ', state)
URLLINKS = createURL(state, TYPE, GEOGRAPHY, LEVEL, DATA, DICTIONARY)
DIRECTORY = '/home/justin/QWI/' + YEAR + 'Q' + QUARTER + '/' + state
if not os.path.exists(DIRECTORY[:-2]):
os.makedirs(DIRECTORY[:-2])
if not os.path.exists(DIRECTORY):
os.makedirs(DIRECTORY)
downLoadedURLs = DIRECTORY[:-2] + 'downLoadedURLs.txt'
if not os.path.isfile(downLoadedURLs):
with open(downLoadedURLs, 'a') as downloaded:
downloaded.write('')
with open(downLoadedURLs) as downloaded:
URLcontent = downloaded.read().splitlines()
URLLINKS = [x for x in URLLINKS if x not in URLcontent]
for url in URLLINKS:
print ('Downloading data: ', url)
save = DIRECTORY + '/' + os.path.basename(url)
urllib.urlretrieve(url, save)
with open(downLoadedURLs, 'a') as downloaded:
downloaded.write('{}\n'.format(url))
if os.stat(save).st_size == 0:
shutil.rmtree(DIRECTORY)
with open(DIRECTORY[:-2] + '/zeroDataStates.txt', 'a') as zeroData:
zeroData.write('{}\n'.format(state))
break
with open(completedStates, 'a') as completedState:
completedState.write('{}\n'.format(state))
以下是并行代码:
from joblib import Parallel, delayed
STATE = ['al', 'ak', etc...]
Parallel(n_jobs = CORES)(delayed(downloadData)\
(state, TYPE, GEOGRAPHY, LEVEL, DATA, DICTIONARY, YEAR, QUARTER,
completedStates) for state in STATE)
我认为在写入文件或获取网址时会发生错误。