如何在没有MemoryError的情况下进行pd.merge?

时间:2019-06-14 15:24:40

标签: python pandas

每次我尝试合并大型数据帧时,都会收到内存错误,因为它们太大了。有什么办法可以逐步做到这一点?

使用来自以下来源的AirBnB数据:

https://github.com/FraPochetti/Airbnb/blob/master/data/train_users_2.csv https://github.com/FraPochetti/Airbnb/blob/master/data/test_users.csv https://github.com/jafriyie1/Airbnb-New-User-Bookings/blob/master/sessions.csv.zip

import pandas as pd
import numpy as np
train_users = pd.read_csv("train_users_2.csv")
test_users = pd.read_csv("test_users.csv")
df = pd.concat((train_users, test_users), axis = 0, ignore_index = True, sort = True)
df_without_NDF = df[df['country_destination']!='NDF']
sessions = pd.read_csv("sessions.csv")
session_booked = pd.merge(df_without_NDF, sessions, how = 'left', left_on = 'id', right_on = 'user_id')
sessions.rename(columns = {'user_id': 'id'}, inplace=True)
action_count = sessions.groupby(['id', 'action'])['secs_elapsed'].agg(len).unstack()
action_type_count = sessions.groupby(['id', 'action_type'])['secs_elapsed'].agg(len).unstack()
action_detail_count = sessions.groupby(['id', 'action_detail'])['secs_elapsed'].agg(len).unstack()
device_type_sum = sessions.groupby(['id', 'device_type'])['secs_elapsed'].agg(sum).unstack()
sessions_data = pd.concat([action_count, action_type_count, action_detail_count, device_type_sum],axis=1)
sessions_data.columns = sessions_data.columns.map(lambda x: str(x) + '_count')
sessions_data['most_used_device'] = sessions.groupby('id')['device_type'].max()
secs_elapsed = sessions.groupby('id')['secs_elapsed']
secs_elapsed = secs_elapsed.agg(
    {
        'secs_elapsed_sum': np.sum,
        'secs_elapsed_mean': np.mean,
        'secs_elapsed_min': np.min,
        'secs_elapsed_max': np.max,
        'secs_elapsed_median': np.median,
        'secs_elapsed_std': np.std,
        'secs_elapsed_var': np.var,
        'day_pauses': lambda x: (x > 86400).sum(),
        'long_pauses': lambda x: (x > 300000).sum(),
        'short_pauses': lambda x: (x < 3600).sum(),
        'session_length' : np.count_nonzero
    }
)
secs_elapsed.reset_index(inplace=True)
sessions_data.index.name = 'id'
sessions_secs_elapsed = pd.merge(sessions_data, secs_elapsed, on='id', how='left')
df = pd.merge(df, sessions_secs_elapsed, on='id', how = 'left')

错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-3-050db858f400> in <module>()
     34 sessions_data.index.name = 'id'
     35 sessions_secs_elapsed = pd.merge(sessions_data, secs_elapsed, on='id', how='left')
---> 36 df = pd.merge(df, sessions_secs_elapsed, on='id', how = 'left')

0 个答案:

没有答案