我有一个Python脚本,用于查询数据库并获取结果集。然后,我使用一组使用subprocess.check_call()
调用的bash脚本对此结果集执行一系列步骤。所以目前,这是我的Python脚本:
...
...
def get_ids_and_process():
bqc = bigquery.Client.from_service_account_json(gcloud_account_key)
query = (
'SELECT ProjectID, ProjectName, UserEmail from `table_name`')
query_job = bqc.query(query)
data = query_job.result()
data_rows = list(data)
if len(data_rows) == 0:
sys.exit()
else:
for row in data_rows:
subprocess.check_call(
[scripts_dir + "create_project.sh", str(row['ProjectID']), str(row['ProjectName']), folder_id])
subprocess.check_call([scripts_dir + "link_billing.sh", str(row['ProjectID']), account_id])
subprocess.check_call([scripts_dir + "add_iam_policy.sh", str(row['ProjectID']), str(row['UserEmail'])])
subprocess.check_call([scripts_dir + "create_datasets.sh", str(row['ProjectID'])])
subprocess.check_call([scripts_dir + "create_tables.sh", str(row['ProjectID'])])
...
...
由于这本质上是迭代的,执行所有这些脚本需要一段时间,因此我想到使用joblib
和pool.map()
来并行化循环。然而,它没有用,我达到了#34;达到最大递归深度!" joblib
出错,pool.map()
出现类似错误。
在我的几个bash脚本中,我读取了一个文件,对于文件的每个值,我运行一个命令以及从check_call()
传递给它的参数。我使用&
并行化了这个循环,它完美地工作。所以我认为,由于joblib
没有成功,我可以将结果集作为列表传递给bash脚本并在那里并行处理数组的每个元素。像这样:
projectids = []
projectnames = []
emails = []
def process_ids():
subprocess.check_call(
[scripts_dir + "create_project_from_array.sh", projectids, projectnames])
subprocess.check_call([scripts_dir + "link_billing_from_array.sh", projectids])
def get_ids():
bqc = bigquery.Client.from_service_account_json(gcloud_account_key)
query = (
'SELECT ProjectID, ProjectName, UserEmail from `table_name`')
query_job = bqc.query(query)
data = query_job.result()
data_rows = list(data)
if len(data_rows) == 0:
sys.exit()
else:
for row in data_rows:
projectids.append(row['ProjectID'])
projectnames.append(row['ProjectName'])
emails.append(row['UserEmail'])
process_ids()
get_ids()
然后,在我的bash脚本中,我读取了数组和每对元素,使用&
并行运行所需的命令。但是,显然,我们无法将列表传递给subprocess.check_call()
。在这种情况下我有什么选择?是否有可能做我想做的事情?