从pandas Dataframe到MongoDB的数据加载速度缓慢

时间:2018-12-17 08:53:51

标签: python python-3.x mongodb pandas performance

我有2个大型csv文件,需要将其加载到Mongo集合中。首先,我将数据读入pandas Dataframe中,进行一些预处理,然后将结果字典插入Mongo集合中。问题在于性能很慢,因为它是按顺序执行的,并且应该在第一个集合已填充完毕(用外键更新行)之后才将数据加载到第二个集合中。如何加快加载过程?

index.ts

1 个答案:

答案 0 :(得分:0)

TL; DR

insert in bulk而不是一次插入一条记录。

insert_many

目前您拥有:

def insert_to_collection(collection: pymongo.collection.Collection, data: dict):
    collection.insert(data)

顺便说一下,您正在使用不推荐使用的insert()

您想拥有的是:

def insert_to_collection(collection: pymongo.collection.Collection, data: list):
    collection.insert_many(data)

因此,在您的两个函数:fill_movie_datafill_actors_data中,您可以偶尔调用一次并批量插入,而不必在循环中一直调用insert_to_collection()。 / p>

代码

以下是您发布的代码,并进行了一些修改:

添加一个max_bulk_size,越大越好,只要确保它不超过RAM。

max_bulk_size = 500

添加一个results_list并附加result_dict。列表大小达到max_bulk_size后,将其保存并清空列表。

def fill_movie_data():
    '''
    iterates over movie Dataframe
    process values and creates dict structure
    with specific attributes to insert into MongoDB movie collection
    '''


    # load data to pandas Dataframe
    logger.info("Reading movie data to Dataframe")
    data = read_data('datasets/title.basics.tsv')

    results_list = []

    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['tconst']
        title = row['primaryTitle']

        # check value of movie year (if not NaN)
        if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
            year = list([row['startYear'], row['endYear']])
        elif not pd.isnull(row['startYear']):
            year = int(row['startYear'])
        else:
            year = None

        # check value of movie duration (if not NaN)
        if not pd.isnull(row['runtimeMinutes']):
            try:
                duration = int(row['runtimeMinutes'])
            except ValueError:
                duration = None
        else:
            duration = None

        # check value of genres (if not NaN)
        if not pd.isnull(row['genres']):
            genres = row['genres'].split(',')
        else:
            genres = None

        result_dict['_id'] = id_
        result_dict['primary_title'] = title

        # if both years have values
        if isinstance(year, list):
            result_dict['year_start'] = int(year[0])
            result_dict['year_end'] = int(year[1])

        # if start_year has value
        elif year:
            result_dict['year'] = year

        if duration:
            result_dict['duration'] = duration

        if genres:
            result_dict['genres'] = genres

        results_list.append(result_dict)

        if len(results_list) > max_bulk_size:
            insert_to_collection(movie_collection, results_list)
            results_list = []

与其他循环相同。

def fill_actors_data():
    '''
    iterates over actors Dataframe
    process values, creates dict structure
    with new fields to insert into MongoDB actors collection
    '''


    logger.info("Inserting data to actors collection")
    # load data to pandas Dataframe
    logger.info("Reading actors data to Dataframe")
    data = read_data('datasets/name.basics.tsv')

    logger.info("Inserting data to actors collection")

    results_list = []

    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['nconst']
        name = row['primaryName']

        # if no birth year and death year value
        if pd.isnull(row['birthYear']):
            yob = None
            alive = False
        # if both birth and death year have value
        elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
            yob = int(row['birthYear'])
            death = int(row['deathYear'])
            age = death - yob
            alive = False
        # if only birth year has value
        else:
            yob = int(row['birthYear'])
            current_year = datetime.now().year
            age = current_year - yob
            alive = True

        if not pd.isnull(row['knownForTitles']):
            movies = row['knownForTitles'].split(',')

        result_dict['_id'] = id_
        result_dict['name'] = name
        result_dict['yob'] = yob
        result_dict['alive'] = alive
        result_dict['age'] = age
        result_dict['movies'] = movies

        results_list.append(result_dict)

        if len(results_list) > max_bulk_size:
            insert_to_collection(actors_collection, results_list)
            results_list = []

        # update movie documents with list of actors ids
        movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})