我有一个数据帧(new_dataframe
)看起来像这样:
| time | other | another |
--------------------------------------
| dt_obj | ....... | ......... |
| dt_obj | ....... | ......... |
| dt_obj | ....... | ......... |
| dt_obj | ....... | ......... |
还有另一个看起来像这样的数据框(reference
):
| Time | Day | targ_field |
--------------------------------------
| time_obj | weekday | ......... |
| time_obj | Saturday | ......... |
| time_obj | Sunday | ......... |
| time_obj | ....... | ......... |
我想将reference [“ Time”]。time()行的值与new_dataframe [“ time”]进行比较,并将reference [“ targ_field”]行的值添加到new_dataframe [“ time”]的相关行中。问题在于,我基于每行(reference
的dt_obj来选择4个不同的def choose_ref(dt))
数据帧。
以下是我的代码:
def get_day(dt):
"""
Return a string based on a given datetime.
:param dt: datetime of interest
:return: string that shows if that day was Sunday, Saturday weekday"""
if dt.weekday() == 6:
result = "Sunday"
elif dt.weekday() == 5:
result = "Saturday"
else:
result = "weekday"
return result
def choose_ref(dt):
year = dt.year
flag_1 = aware_datetime_obj_1
flag_2 = aware_datetime_obj_2
flag_3 = aware_datetime_obj_3
flag_4 = aware_datetime_obj_4
if flag_1 < dt < flag_2:
return filename_1
elif flag_2 < dt < flag_3:
return filename_2
elif flag_3 < dt < flag_4:
return filename_3
else:
return filename_4
def add_ref(dt, reference_p):
day = get_day(dt)
reference_file = os.path.join(reference_p, choose_ref(dt))
reference = pandas.read_csv(reference_file)
reference["Time"] = pandas.to_datetime(reference["Time"], format="%H:%M:%S").dt.time # convert strings in the "Time" column to time objects
return reference.loc[(reference["Time"] == dt.time()) & (reference["Day"] == day)]
orig_dataframe = pandas.read_csv(orig_path)
orig_dataframe["time"] = pandas.to_datetime(traffic["time"]).dt.tz_localize("UTC") # convert strings in the "time" column to aware datetime objects
new_dataframe = orig_dataframe.copy()
new_dataframe["targ_field"] = new_dataframe.apply(lambda df: add_ref(df["time"], reference_p), axis=1)
现在的问题(至少是我的想法)是,由于我要检查每一行中的每个dt_obj,因此它会不断创建许多reference_dataframe,并且每次运行脚本时,我的CPU负载都会达到99%以上。
该如何减少呢?您能想到一种更聪明的方法来解决我的问题或更好的编码吗?
谢谢
已编辑以解决错字。
答案 0 :(得分:0)
似乎您每次都在读取文件以进行检查-这是第一个大的禁忌。相反,如果您希望所有功能都使用纯函数,则将4个ref函数作为add_ref之外的预加载数据帧,并让select_ref只是选择使用哪个函数。看起来像这样:
编辑:另外,我注意到您在函数中执行to_datetime
-这也占用了大量的计算能力。我已经更改了解决方案,以便也将其移出。
file_1, file_2, file_3, file_4 = pandas.read_csv(filename_1), pandas.read_csv(etc...
for i in [file_1, file_2, file_3, file_4]:
i.Time = pandas.to_datetime(i.Time, format="%H:%M:%S").dt.time
def choose_ref(dt):
year = dt.year
flag_1 = aware_datetime_obj_1
flag_2 = aware_datetime_obj_2
flag_3 = aware_datetime_obj_3
flag_4 = aware_datetime_obj_4
if flag_1 < dt < flag_2:
return file_1
elif flag_2 < dt < flag_3:
return file_2
elif flag_3 < dt < flag_4:
return file_3
else:
return file_4
然后
def add_ref(dt, reference):
day = get_day(dt)
return reference.loc[(reference["Time"] == dt.time()) & (reference["Day"] == day)]
最后更改以下行:
new_dataframe["targ_field"] = new_dataframe.apply(lambda df: add_ref(df["time"], choose_ref(df["time"])), axis=1)