从另一个DataFrame(具有不同形状)填充NaN值

时间:2018-03-02 03:02:08

标签: python pandas dataframe

我正在寻找一种更快的方法来改善我的解决方案的性能,以解决以下问题:某个DataFrame有两列,其中包含一些 NaN 值。挑战是用来自辅助数据框架的值替换这些 NaNs

下面我将分享用于实现我的方法的数据和代码。让我解释一下这个场景:merged_df是包含几列的原始DataFrame,其中一些列包含 NaN 值的行:

enter image description here

从上图中可以看出,day_of_weekholiday_flg列特别重要。我想通过查看名为date_info_df的第二个DataFrame来填充这些列的 NaN 值,如下所示:

enter image description here

通过使用visit_datemerged_df列中的值,可以在calendar_date上搜索第二个DataFrame并找到相应的匹配项。此方法允许从第二个DataFrame中获取day_of_weekholiday_flg的值。

本练习的最终结果是一个如下所示的DataFrame:

enter image description here

您会注意到我使用的方法依赖apply()merged_df的每一行执行自定义函数:

  • 对于每一行,在day_of_weekholiday_flg中搜索 NaN 值;
  • 当在这些列中的任何一列或两列上找到 NaN 时,请使用该行visit_date中的可用日期在第二个DataFrame中查找等效匹配项,特别是date_info_df['calendar_date']列;
  • 成功匹配后,date_info_df['day_of_week']的值必须复制到merged_df['day_of_week']date_info_df['holiday_flg']的值也必须复制到date_info_df['holiday_flg']

这是一个有用的源代码

import math
import pandas as pd
import numpy as np
from IPython.display import display

### Data for df
data = { 'air_store_id':     [              'air_a1',     'air_a2',     'air_a3',     'air_a4' ], 
         'area_name':        [               'Tokyo',       np.nan,       np.nan,       np.nan ], 
         'genre_name':       [            'Japanese',       np.nan,       np.nan,       np.nan ], 
         'hpg_store_id':     [              'hpg_h1',       np.nan,       np.nan,       np.nan ],          
         'latitude':         [                  1234,       np.nan,       np.nan,       np.nan ], 
         'longitude':        [                  5678,       np.nan,       np.nan,       np.nan ],         
         'reserve_datetime': [ '2017-04-22 11:00:00',       np.nan,       np.nan,       np.nan ], 
         'reserve_visitors': [                    25,           35,           45,       np.nan ], 
         'visit_datetime':   [ '2017-05-23 12:00:00',       np.nan,       np.nan,       np.nan ], 
         'visit_date':       [ '2017-05-23'         , '2017-05-24', '2017-05-25', '2017-05-27' ],
         'day_of_week':      [             'Tuesday',  'Wednesday',       np.nan,       np.nan ],
         'holiday_flg':      [                     0,       np.nan,       np.nan,       np.nan ]
       }

merged_df = pd.DataFrame(data)
display(merged_df)

### Data for date_info_df
data = { 'calendar_date':     [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ], 
         'day_of_week':       [    'Tuesday',  'Wednesday',   'Thursday',     'Friday',   'Saturday',     'Sunday' ], 
         'holiday_flg':       [            0,            0,            0,            0,            1,            1 ]         
       }

date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 
display(date_info_df)

# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
    weekday = row['day_of_week']   
    holiday = row['holiday_flg']

    # search dataframe date_info_df for the appropriate value when weekday is NaN
    if (type(weekday) == float and math.isnan(weekday)):
        search_date = row['visit_date']                               
        #print('  --> weekday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        weekday = date_info_df.at[idx,'day_of_week']
        #print('  --> weekday search_date=', search_date, 'is', weekday)        
        row['day_of_week'] = weekday        

    # search dataframe date_info_df for the appropriate value when holiday is NaN
    if (type(holiday) == float and math.isnan(holiday)):
        search_date = row['visit_date']                               
        #print('  --> holiday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        holiday = date_info_df.at[idx,'holiday_flg']
        #print('  --> holiday search_date=', search_date, 'is', holiday)        
        row['holiday_flg'] = int(holiday)

    return row


# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1) 

# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)

display(merged_df)

我做了一些测量,以便你能理解这场斗争:

  • 6 行的DataFrame上,apply()需要 3.01 ms ;
  • 在包含〜 250000 行的DataFrame上,apply()需要 2分51秒
  • 在包含〜 1215000 行的DataFrame上,apply()需要 4分钟2s

如何改善此任务的效果?

3 个答案:

答案 0 :(得分:4)

您可以使用Index加快查询速度,使用combine_first()填充NaN:

cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
    date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))

print(merged_df[cols])

结果:

 day_of_week  holiday_flg
0     Tuesday          0.0
1   Wednesday          0.0
2    Thursday          0.0
3    Saturday          1.0

答案 1 :(得分:1)

这是一个解决方案。它应该是高效的,因为没有明确的mergeapply

merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date']) 
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 

s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']

merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))

<强>结果

  air_store_id area_name day_of_week genre_name  holiday_flg hpg_store_id  \
0       air_a1     Tokyo     Tuesday   Japanese          0.0       hpg_h1   
1       air_a2       NaN   Wednesday        NaN          0.0          NaN   
2       air_a3       NaN    Thursday        NaN          0.0          NaN   
3       air_a4       NaN    Saturday        NaN          1.0          NaN   

   latitude  longitude     reserve_datetime  reserve_visitors visit_date  \
0    1234.0     5678.0  2017-04-22 11:00:00              25.0 2017-05-23   
1       NaN        NaN                  NaN              35.0 2017-05-24   
2       NaN        NaN                  NaN              45.0 2017-05-25   
3       NaN        NaN                  NaN               NaN 2017-05-27   

        visit_datetime  
0  2017-05-23 12:00:00  
1                  NaN  
2                  NaN  
3                  NaN  

<强>解释

  • s是来自pd.Series的{​​{1}}映射calendar_date到day_of_week。
  • 使用pd.Series.map,以date_info_df作为输入,尽可能更新缺失值。

答案 2 :(得分:0)

编辑:还可以使用merge来解决问题。比旧方法快10倍。 (需要确保"visit_date""calendar_date"具有相同的格式。)

# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], 
                left_on="visit_date", 
                right_on="calendar_date", 
                how="left") # outer should also work

所需的结果现在位于"day_of_week_y""holiday_flg_y"列。在这种方法和map方法中,我们根本不使用旧的"day_of_week""holiday_flg"。我们只需要将结果从data_info_df映射到merged_df

merge也可以完成这项工作,因为data_info_df的数据条目是唯一的。 (不会创建重复项。)

您也可以尝试使用pandas.Series.map。它的作用是

  

使用输入对应关系(可以是字典,系列或函数)映射系列的值

# set "calendar_date" as the index such that 
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")

# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])

注意merged_df.visit_date最初属于字符串类型。因此,我们使用

merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

使其成为日期时间。

karlphillip提供了

计时 date_info_df datasetmerged_df

date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")   
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)

# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))    
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如果将结果分配回merged_df ,可以看出HYRY的方法运行速度提高了3倍。这就是为什么我认为HARY的方法乍一看比我快。我怀疑这是因为combine_first的性质。我想HARY方法的速度将取决于它在merged_df中的稀疏程度。因此,在返回结果的同时,列变满了;因此,在重新运行时,它会更快。

mergecombine_first方法的表现几乎相同。也许可能存在一个比另一个更快的情况。应由每个用户对其数据集进行一些测试。

这两种方法之间需要注意的另一件事是merge方法假设 merged_df中的每个日期都包含在data_info_df中。如果merged_dfdata_info_df中包含某些日期,则应返回NaN。并且NaN可以覆盖最初包含值的merged_df的某些部分! 这是首选combine_first方法。请参阅Pandas replace, multi column criteria

中MaxU的讨论