如何修改pandas df的解析逻辑以减少CPU负载?

时间:2019-09-11 14:45:51

标签: python pandas

我有一个数据帧(new_dataframe)看起来像这样:

|   time   |   other   |   another   |
--------------------------------------
|  dt_obj  |  .......  |  .........  |
|  dt_obj  |  .......  |  .........  |
|  dt_obj  |  .......  |  .........  |
|  dt_obj  |  .......  |  .........  |

还有另一个看起来像这样的数据框(reference):

|   Time   |    Day    |  targ_field |
--------------------------------------
| time_obj |  weekday  |  .........  |
| time_obj | Saturday  |  .........  |
| time_obj |  Sunday   |  .........  |
| time_obj |  .......  |  .........  |

我想将reference [“ Time”]。time()行的值与new_dataframe [“ time”]进行比较,并将reference [“ targ_field”]行的值添加到new_dataframe [“ time”]的相关行中。问题在于,我基于每行(reference的dt_obj来选择4个不同的def choose_ref(dt))数据帧。

以下是我的代码:

def get_day(dt):
    """
    Return a string based on a given datetime.
    :param dt: datetime of interest
    :return: string that shows if that day was Sunday, Saturday weekday"""
    if dt.weekday() == 6:
        result = "Sunday"
    elif dt.weekday() == 5:
        result = "Saturday"
    else:
        result = "weekday"
    return result

def choose_ref(dt):
    year = dt.year
    flag_1 = aware_datetime_obj_1
    flag_2 = aware_datetime_obj_2
    flag_3 = aware_datetime_obj_3
    flag_4 = aware_datetime_obj_4
    if flag_1 < dt < flag_2:
        return filename_1
    elif flag_2 < dt < flag_3:
        return filename_2
    elif flag_3 < dt < flag_4:
        return filename_3
    else:
        return filename_4

def add_ref(dt, reference_p):
    day = get_day(dt)
    reference_file = os.path.join(reference_p, choose_ref(dt))
    reference = pandas.read_csv(reference_file)
    reference["Time"] = pandas.to_datetime(reference["Time"], format="%H:%M:%S").dt.time  # convert strings in the "Time" column to time objects
    return reference.loc[(reference["Time"] == dt.time()) & (reference["Day"] == day)]


orig_dataframe = pandas.read_csv(orig_path)
orig_dataframe["time"] = pandas.to_datetime(traffic["time"]).dt.tz_localize("UTC") # convert strings in the "time" column to aware datetime objects
new_dataframe = orig_dataframe.copy()
new_dataframe["targ_field"] = new_dataframe.apply(lambda df: add_ref(df["time"], reference_p), axis=1)

现在的问题(至少是我的想法)是,由于我要检查每一行中的每个dt_obj,因此它会不断创建许多reference_dataframe,并且每次运行脚本时,我的CPU负载都会达到99%以上。

该如何减少呢?您能想到一种更聪明的方法来解决我的问题或更好的编码吗?

谢谢

已编辑以解决错字。

1 个答案:

答案 0 :(得分:0)

似乎您每次都在读取文件以进行检查-这是第一个大的禁忌。相反,如果您希望所有功能都使用纯函数,则将4个ref函数作为add_ref之外的预加载数据帧,并让select_ref只是选择使用哪个函数。看起来像这样:

编辑:另外,我注意到您在函数中执行to_datetime -这也占用了大量的计算能力。我已经更改了解决方案,以便也将其移出。

file_1, file_2, file_3, file_4 = pandas.read_csv(filename_1), pandas.read_csv(etc...
for i in [file_1, file_2, file_3, file_4]:
    i.Time = pandas.to_datetime(i.Time, format="%H:%M:%S").dt.time


def choose_ref(dt):
    year = dt.year
    flag_1 = aware_datetime_obj_1
    flag_2 = aware_datetime_obj_2
    flag_3 = aware_datetime_obj_3
    flag_4 = aware_datetime_obj_4
    if flag_1 < dt < flag_2:
        return file_1
    elif flag_2 < dt < flag_3:
        return file_2
    elif flag_3 < dt < flag_4:
        return file_3
    else:
        return file_4

然后

def add_ref(dt, reference):
    day = get_day(dt)
    return reference.loc[(reference["Time"] == dt.time()) & (reference["Day"] == day)]

最后更改以下行:

new_dataframe["targ_field"] = new_dataframe.apply(lambda df: add_ref(df["time"], choose_ref(df["time"])), axis=1)