在添加新数据

时间:2017-08-01 12:55:22

标签: python python-3.x pandas

我目前有一个工作算法,允许我根据他们的ID及其DateB的值和值来更新我的数据库行( BaseA )在新的导出中( BaseB )。问题是我的算法非常无效。这只是一个示例,真实代码必须适用于任意数量的列,无论其名称是什么(在任何地方找到的唯一列是IDDateBNbTreated)(也可以列出它们。)

如何让计算速度更快?(目前在实际数据上花费近一个小时)

BaseA:

  ID      DateA         DateB          DateC    Nb Treated
  A     11/07/2017   11/07/2017     11/07/2017      1
  B     12/07/2017   10/05/2017     12/07/2017      1
  B     12/07/2017   12/07/2017     12/07/2017      2
  C     13/07/2017   13/07/2017     13/07/2017      1
  D     14/07/2017   14/07/2017     14/07/2017      1
  E     15/07/2017                  15/07/2017      0
  F     16/07/2017   16/07/2017     16/07/2017      1
  G     17/07/2017   17/07/2017     17/07/2017      2
  J     18/07/2017                  18/07/2017      0
  G     17/07/2017   15/09/2016     17/07/2017      1

BaseB:

  ID       DateA           DateB           DateC
  A     11/07/2017      11/07/2017      11/07/2017
  B     13/06/2017      13/06/2017      13/06/2017
  C     14/06/2017      14/06/2017      14/06/2017
  E     15/07/2017      15/07/2017      15/07/2017
  F     16/07/2017      16/07/2017      16/07/2017
  H     11/06/2017      11/06/2017
  I     12/06/2017      12/06/2017      12/06/2017

我想得到什么:

  ID      DateA           DateB           DateC     Nb Treated
  A     11/07/2017      11/07/2017      11/07/2017      1
  B     13/06/2017      12/07/2017      13/06/2017      2
  B     13/06/2017      13/06/2017      13/06/2017      3
  B     13/06/2017      10/05/2017      13/06/2017      1
  C     14/06/2017      14/06/2017      14/06/2017      2
  C     14/06/2017      13/07/2017      14/06/2017      1
  D     14/07/2017      14/07/2017      14/07/2017      1
  E     15/07/2017      15/07/2017      15/07/2017      1
  F     16/07/2017      16/07/2017      16/07/2017      1
  G     17/07/2017      17/07/2017      17/07/2017      2
  G     17/07/2017      15/09/2016      17/07/2017      1
  H     11/06/2017                      11/06/2017      0
  I     12/06/2017      12/06/2017      12/06/2017      1
  J     18/07/2017                      18/07/2017      0

我的代码一般没有算法:

import pandas as pd
import numpy as np

database = pd.read_excel("baseA.xlsx")
dataset = pd.read_excel("baseB.xlsx")

 # INSERT THE ALGORITHM HERE

datater = pd.concat([database,dataset])
datater.drop_duplicates(["ID","DateB"], inplace = True)
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]) & pd.isnull(datater["DateB"]), 0,datater["Nb Treated"]) 
datatri = datater.groupby(["ID"], sort=False)["Nb Treated"].max()
dicoREFNbDevis = datatri.to_dict()
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]), datater["ID"],datater["Nb Treated"]) 
dicoREFNbDevis = {k: v+1 for k, v in dicoREFNbDevis.items()}
datater["Nb Treated"].replace(dicoREFNbDevis, inplace=True)
datater["Nb Treated"]=datater["Nb Treated"].fillna(1)

datater=datater.sort(["ID"])

datater=datater[["ID","DateA","DateB","DateC","Nb Treated"]]
writer = pd.ExcelWriter('NewBase.xlsx', engine='xlsxwriter') 
datater.to_excel(writer, sheet_name='Base', index=False)
writer.save()

我目前使用且正在运行的算法:

database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
for i in dataset :
    if i != "ID" and i != "DateB" and i != "Nb Treated" :
        dicoDate = dataset.set_index("ID")[i].to_dict() 
        database[i]=np.where(database["ID"].isin(dataset["ID"].unique()),database["ID"],database[i])
        database[i].replace(dicoDate, inplace=True)
        database[i]=database[i].apply(lambda x : pd.to_datetime(x))

尝试几乎工作:

Modifying a subset of rows in a pandas dataframe启发

database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
database.ix[database["ID"].isin(dataset["ID"].unique()), ['DateA','DateC']] =  dataset.ix[dataset["ID"].isin(database["ID"].unique()), ['DateA','DateC']] 

给我这个输出:

  ID       DateA          DateB            DateC     Nb Treated
  A     11/07/2017      11/07/2017      11/07/2017      1
  B     14/06/2017      12/07/2017      14/06/2017      2
  B     13/06/2017      13/06/2017      13/06/2017      3
  B     13/06/2017      10/05/2017      13/06/2017      1
  C                     13/07/2017                      1
  C     14/06/2017      14/06/2017      14/06/2017      2
  D     14/07/2017      14/07/2017      14/07/2017      1
  E     15/07/2017      15/07/2017      15/07/2017      1
  F                     16/07/2017                      1
  G     17/07/2017      15/09/2016      17/07/2017      1
  G     17/07/2017      17/07/2017      17/07/2017      2
  H     11/06/2017                      11/06/2017      0
  I     12/06/2017      12/06/2017      12/06/2017      1
  J     18/07/2017                      18/07/2017      0

编辑:

算法的用途是使用新导出 BaseB 的值更新 BaseA BaseB 中的值是数据库中案例的最新版本。如果我在 BaseA BaseB 中拥有相同的ID,则可能会有不同的情况:
  - 通常您从 BaseA 中删除该行,然后从 BaseB 中删除该行   - 但如果 BaseA 中的行DateB BaseB 的行不同,则这两行应位于 BaseA 中。但是你总是必须每次更新其他列。

0 个答案:

没有答案