合并(并有条件地更新)Pandas DataFrames

时间:2019-06-27 09:39:17

标签: python pandas dataframe merge

我正在抓取一些数据(自动执行一个小时的任务),并希望将新收集的数据附加到我开始使用的数据框中,以便该数据框(以下内容:df_total)保持不变越来越大。

每次抓取数据都会获得一个时间戳,但是我不想在再次抓取相同数据点时覆盖时间戳(可以由唯一的ID标识)。
此外,可以将条目标记为 active inactive ,如果以前的 active 条目在以后的抓取中变成了 inactive ,通过添加一个新的时间戳显示该数据(自该报价处于非活动状态以来),应在数据中明确指出这一点。

我要合并和更新以下数据(df_totaldf_new

# df_total:
    ID Names        Timestamp  Active Inactive-Stamp
0  121     A  20190626-120000    True               
1  122     B  20190626-120000    True               
2  123     C  20190626-120000    True               

# df_new:
    ID Names        Timestamp  Active Inactive-Stamp
0  122     B  20190627-140000    True               
1  123     C  20190627-140000   False               
2  124     D  20190627-140000    True               

有效方式:
合并和更新本身的工作取决于以下方法:

def join_data(df_total, df_new):
    # Doing a full outer join
    df = df_total.merge(df_new, on=list(df_total), how='outer')

    # Drop real duplicates (e.g. ID:122, Name:B)
    df = df.drop_duplicates(subset=['ID', 'Active'], keep='first')

    # "Add" the timestamp for the now inactive entry
    df.loc[(df.duplicated(subset=['ID'],
                          keep='last') == True), 'Inactive-Stamp'] = "newTimeStamp"

    # Change `active` (from `True` to `False`)
    df.loc[(df.duplicated(subset=['ID'],
                          keep='last') == True), 'Active'] = False

    # Delete the new duplicate
    df = df[(df.duplicated(subset=['ID'],
                           keep='first') == False)]

    # Reset index of new Dataframe 
    df.reset_index(inplace=True, drop=True)
    return(df)

df_total = join_data(df_total, df_new)

这将导致DataFrame合并并按计划更新条目:

    ID Names        Timestamp  Active Inactive-Stamp
0  121     A  20190626-120000    True               
1  122     B  20190626-120000    True               
2  123     C  20190626-120000   False   newTimeStamp
3  124     D  20190627-140000    True               

出了什么问题
条目可以再次更改回active = True
这样做的问题是,某些用户只会取消激活他们的输入,然后再次激活它(这样他们才会再次显示在网站顶部)。
因此,上述方法必须在非活动数据再次处于活动状态时起作用。考虑以下(甚至更新的)数据:

# df_newer:
    ID Names        Timestamp  Active Inactive-Stamp
0  123     C  20190628-160000    True               
1  125     E  20190628-160000    True               

我尝试过的事情
我尝试通过添加(对于timestampactive)来修改方法:

# Changing the Inactive-Stamp:
df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
        &
        (df['Active'][df.duplicated(subset=['ID'], keep='last')==True] == True),
        'Inactive-Stamp'] = "Timestamp"

df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last') =True] == False),
       'Inactive-Stamp'] = ""


# Changing the Active-Boolean
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==True),
       'Active'] = False
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==False),
       'Active'] = True

我希望仅在要约更改为无效时创建Inactive-Stamp,并仅在相反时删除/替换Inactive-Stamp
显然,这确实会立即将每个更改为 inactive active 更改回 active
这就是为什么我想到了在if-条件下执行这些呈现的语句的情况,该条件看起来像(伪代码):

if (duplicate ID with different active & new entry is inactive):
    df['Active'] = False
    df['Inactive-Stamp'] = newTimeStamp
elif (duplicate ID with different active & new entry is active):
    df['Active'] = True
    df['Inactive-Stamp'] = ''

但是我既不知道如何在pandas DataFrame中实现该if条件,又不知道df.loc函数的组合,也不知道这是否是一个好习惯。

0 个答案:

没有答案