我正在使用python BigQuery API列出表内容,然后处理给定的JSON响应。
其他有用的信息。
包含主线程的100,000条记录。
花更少的时间吗?
有人能让我造成这种性能滞后的原因吗?
def _get_values(val):
if isinstance(val, datetime.datetime):
return str(val)
else:
return val
def map_schema(row):
row_dict = {}
values = row.values()
field_to_index = row._xxx_field_to_index
for field, index in field_to_index.iteritems():
row_dict[str(field)] = _get_values(values[index])
return row_dict
def write_json(file, row):
file.write(json.dumps(row))
def _save_rows(table, start_index, max_row, file):
rows = client.list_rows(table, max_results=max_row, start_index=start_index)
for row in rows:
processedRow = map_schema(row)
write_json(file, processedRow)
def run():
threads = []
dataset_ref = client.dataset('hacker_news', project='bigquery-public-data')
table_ref = dataset_ref.table('comments')
table = client.get_table(table_ref) # API call
import time
start = time.time()
output_file = open("temp_t.json", "a")
total_rows = 100000
total_threads = 10
max_row = total_rows/total_threads
# 10 threads takes ~ 20 seconds
# 5 threads takes the same ~ 20 seconds
files = []
for index in range(0, total_rows, max_row):
file_name = "%s.json" % index
files.append(open(file_name, "a"))
threads.append(threading.Thread(target=_save_rows, args=(table, index, max_row, output_file)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
for file in files:
file.close()
# takes ~ 30 seconds
# _save_rows(table, 0, 100000, output_file)
# takes ~ 4 seconds
# _save_rows(table, 0, 10000, output_file)
output_file.close()
print "total time = %f" % (time.time() - start)
run()
答案 0 :(得分:0)
不,您不应期望通过Python中的多线程看到任何改进。正如许多人提到的,这是由于GIL的行为所致。由于查询数据是一项占用大量CPU的任务,因此多线程实际上会使情况变得更糟,因为它仅对I / O繁重的任务非常有用。
但是,Python中的多处理对于CPU密集型任务要好得多,因此我会尝试这样做。这是因为多处理是并行性,而多线程只是给人一种并行性的幻觉(一次只运行一个线程,因此是并发的)。