我有2个大型csv文件,需要将其加载到Mongo集合中。首先,我将数据读入pandas Dataframe中,进行一些预处理,然后将结果字典插入Mongo集合中。问题在于性能很慢,因为它是按顺序执行的,并且应该在第一个集合已填充完毕(用外键更新行)之后才将数据加载到第二个集合中。如何加快加载过程?
index.ts
答案 0 :(得分:0)
insert in bulk而不是一次插入一条记录。
insert_many
目前您拥有:
def insert_to_collection(collection: pymongo.collection.Collection, data: dict):
collection.insert(data)
顺便说一下,您正在使用不推荐使用的insert()
。
您想拥有的是:
def insert_to_collection(collection: pymongo.collection.Collection, data: list):
collection.insert_many(data)
因此,在您的两个函数:fill_movie_data
和fill_actors_data
中,您可以偶尔调用一次并批量插入,而不必在循环中一直调用insert_to_collection()
。 / p>
以下是您发布的代码,并进行了一些修改:
添加一个max_bulk_size
,越大越好,只要确保它不超过RAM。
max_bulk_size = 500
添加一个results_list
并附加result_dict
。列表大小达到max_bulk_size
后,将其保存并清空列表。
def fill_movie_data():
'''
iterates over movie Dataframe
process values and creates dict structure
with specific attributes to insert into MongoDB movie collection
'''
# load data to pandas Dataframe
logger.info("Reading movie data to Dataframe")
data = read_data('datasets/title.basics.tsv')
results_list = []
for index, row in data.iterrows():
result_dict = {}
id_ = row['tconst']
title = row['primaryTitle']
# check value of movie year (if not NaN)
if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
year = list([row['startYear'], row['endYear']])
elif not pd.isnull(row['startYear']):
year = int(row['startYear'])
else:
year = None
# check value of movie duration (if not NaN)
if not pd.isnull(row['runtimeMinutes']):
try:
duration = int(row['runtimeMinutes'])
except ValueError:
duration = None
else:
duration = None
# check value of genres (if not NaN)
if not pd.isnull(row['genres']):
genres = row['genres'].split(',')
else:
genres = None
result_dict['_id'] = id_
result_dict['primary_title'] = title
# if both years have values
if isinstance(year, list):
result_dict['year_start'] = int(year[0])
result_dict['year_end'] = int(year[1])
# if start_year has value
elif year:
result_dict['year'] = year
if duration:
result_dict['duration'] = duration
if genres:
result_dict['genres'] = genres
results_list.append(result_dict)
if len(results_list) > max_bulk_size:
insert_to_collection(movie_collection, results_list)
results_list = []
与其他循环相同。
def fill_actors_data():
'''
iterates over actors Dataframe
process values, creates dict structure
with new fields to insert into MongoDB actors collection
'''
logger.info("Inserting data to actors collection")
# load data to pandas Dataframe
logger.info("Reading actors data to Dataframe")
data = read_data('datasets/name.basics.tsv')
logger.info("Inserting data to actors collection")
results_list = []
for index, row in data.iterrows():
result_dict = {}
id_ = row['nconst']
name = row['primaryName']
# if no birth year and death year value
if pd.isnull(row['birthYear']):
yob = None
alive = False
# if both birth and death year have value
elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
yob = int(row['birthYear'])
death = int(row['deathYear'])
age = death - yob
alive = False
# if only birth year has value
else:
yob = int(row['birthYear'])
current_year = datetime.now().year
age = current_year - yob
alive = True
if not pd.isnull(row['knownForTitles']):
movies = row['knownForTitles'].split(',')
result_dict['_id'] = id_
result_dict['name'] = name
result_dict['yob'] = yob
result_dict['alive'] = alive
result_dict['age'] = age
result_dict['movies'] = movies
results_list.append(result_dict)
if len(results_list) > max_bulk_size:
insert_to_collection(actors_collection, results_list)
results_list = []
# update movie documents with list of actors ids
movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})