我目前有一个工作算法,允许我根据他们的ID
及其DateB
的值和值来更新我的数据库行( BaseA )在新的导出中( BaseB )。问题是我的算法非常无效。这只是一个示例,真实代码必须适用于任意数量的列,无论其名称是什么(在任何地方找到的唯一列是ID
,DateB
和NbTreated
)(也可以列出它们。)
如何让计算速度更快?(目前在实际数据上花费近一个小时)
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 12/07/2017 10/05/2017 12/07/2017 1
B 12/07/2017 12/07/2017 12/07/2017 2
C 13/07/2017 13/07/2017 13/07/2017 1
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 0
F 16/07/2017 16/07/2017 16/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
J 18/07/2017 18/07/2017 0
G 17/07/2017 15/09/2016 17/07/2017 1
ID DateA DateB DateC
A 11/07/2017 11/07/2017 11/07/2017
B 13/06/2017 13/06/2017 13/06/2017
C 14/06/2017 14/06/2017 14/06/2017
E 15/07/2017 15/07/2017 15/07/2017
F 16/07/2017 16/07/2017 16/07/2017
H 11/06/2017 11/06/2017
I 12/06/2017 12/06/2017 12/06/2017
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 13/06/2017 12/07/2017 13/06/2017 2
B 13/06/2017 13/06/2017 13/06/2017 3
B 13/06/2017 10/05/2017 13/06/2017 1
C 14/06/2017 14/06/2017 14/06/2017 2
C 14/06/2017 13/07/2017 14/06/2017 1
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 15/07/2017 1
F 16/07/2017 16/07/2017 16/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
G 17/07/2017 15/09/2016 17/07/2017 1
H 11/06/2017 11/06/2017 0
I 12/06/2017 12/06/2017 12/06/2017 1
J 18/07/2017 18/07/2017 0
import pandas as pd
import numpy as np
database = pd.read_excel("baseA.xlsx")
dataset = pd.read_excel("baseB.xlsx")
# INSERT THE ALGORITHM HERE
datater = pd.concat([database,dataset])
datater.drop_duplicates(["ID","DateB"], inplace = True)
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]) & pd.isnull(datater["DateB"]), 0,datater["Nb Treated"])
datatri = datater.groupby(["ID"], sort=False)["Nb Treated"].max()
dicoREFNbDevis = datatri.to_dict()
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]), datater["ID"],datater["Nb Treated"])
dicoREFNbDevis = {k: v+1 for k, v in dicoREFNbDevis.items()}
datater["Nb Treated"].replace(dicoREFNbDevis, inplace=True)
datater["Nb Treated"]=datater["Nb Treated"].fillna(1)
datater=datater.sort(["ID"])
datater=datater[["ID","DateA","DateB","DateC","Nb Treated"]]
writer = pd.ExcelWriter('NewBase.xlsx', engine='xlsxwriter')
datater.to_excel(writer, sheet_name='Base', index=False)
writer.save()
database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
for i in dataset :
if i != "ID" and i != "DateB" and i != "Nb Treated" :
dicoDate = dataset.set_index("ID")[i].to_dict()
database[i]=np.where(database["ID"].isin(dataset["ID"].unique()),database["ID"],database[i])
database[i].replace(dicoDate, inplace=True)
database[i]=database[i].apply(lambda x : pd.to_datetime(x))
受Modifying a subset of rows in a pandas dataframe启发
database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
database.ix[database["ID"].isin(dataset["ID"].unique()), ['DateA','DateC']] = dataset.ix[dataset["ID"].isin(database["ID"].unique()), ['DateA','DateC']]
给我这个输出:
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 14/06/2017 12/07/2017 14/06/2017 2
B 13/06/2017 13/06/2017 13/06/2017 3
B 13/06/2017 10/05/2017 13/06/2017 1
C 13/07/2017 1
C 14/06/2017 14/06/2017 14/06/2017 2
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 15/07/2017 1
F 16/07/2017 1
G 17/07/2017 15/09/2016 17/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
H 11/06/2017 11/06/2017 0
I 12/06/2017 12/06/2017 12/06/2017 1
J 18/07/2017 18/07/2017 0
算法的用途是使用新导出 BaseB 的值更新 BaseA 。 BaseB 中的值是数据库中案例的最新版本。如果我在 BaseA 和 BaseB 中拥有相同的ID,则可能会有不同的情况:
- 通常您从 BaseA 中删除该行,然后从 BaseB 中删除该行
- 但如果 BaseA 中的行DateB
与 BaseB 的行不同,则这两行应位于 BaseA 中。但是你总是必须每次更新其他列。