Question

我正在抓取一些数据（自动执行一个小时的任务），并希望将新收集的数据附加到我开始使用的数据框中，以便该数据框（以下内容：df_total）保持不变越来越大。

每次抓取数据都会获得一个时间戳，但是我不想在再次抓取相同数据点时覆盖时间戳（可以由唯一的ID标识）。
此外，可以将条目标记为 active 或 inactive ，如果以前的 active 条目在以后的抓取中变成了 inactive ，通过添加一个新的时间戳显示该数据（自该报价处于非活动状态以来），应在数据中明确指出这一点。

我要合并和更新以下数据（df_total，df_new）

# df_total:
    ID Names        Timestamp  Active Inactive-Stamp
0  121     A  20190626-120000    True               
1  122     B  20190626-120000    True               
2  123     C  20190626-120000    True               

# df_new:
    ID Names        Timestamp  Active Inactive-Stamp
0  122     B  20190627-140000    True               
1  123     C  20190627-140000   False               
2  124     D  20190627-140000    True

有效方式：
合并和更新本身的工作取决于以下方法：

def join_data(df_total, df_new):
    # Doing a full outer join
    df = df_total.merge(df_new, on=list(df_total), how='outer')

    # Drop real duplicates (e.g. ID:122, Name:B)
    df = df.drop_duplicates(subset=['ID', 'Active'], keep='first')

    # "Add" the timestamp for the now inactive entry
    df.loc[(df.duplicated(subset=['ID'],
                          keep='last') == True), 'Inactive-Stamp'] = "newTimeStamp"

    # Change `active` (from `True` to `False`)
    df.loc[(df.duplicated(subset=['ID'],
                          keep='last') == True), 'Active'] = False

    # Delete the new duplicate
    df = df[(df.duplicated(subset=['ID'],
                           keep='first') == False)]

    # Reset index of new Dataframe 
    df.reset_index(inplace=True, drop=True)
    return(df)

df_total = join_data(df_total, df_new)

这将导致DataFrame合并并按计划更新条目：

    ID Names        Timestamp  Active Inactive-Stamp
0  121     A  20190626-120000    True               
1  122     B  20190626-120000    True               
2  123     C  20190626-120000   False   newTimeStamp
3  124     D  20190627-140000    True

出了什么问题：
条目可以再次更改回active = True。
这样做的问题是，某些用户只会取消激活他们的输入，然后再次激活它（这样他们才会再次显示在网站顶部）。
因此，上述方法必须在非活动数据再次处于活动状态时起作用。考虑以下（甚至更新的）数据：

# df_newer:
    ID Names        Timestamp  Active Inactive-Stamp
0  123     C  20190628-160000    True               
1  125     E  20190628-160000    True

我尝试过的事情：
我尝试通过添加（对于timestamp和active）来修改方法：

# Changing the Inactive-Stamp:
df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
        &
        (df['Active'][df.duplicated(subset=['ID'], keep='last')==True] == True),
        'Inactive-Stamp'] = "Timestamp"

df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last') =True] == False),
       'Inactive-Stamp'] = ""


# Changing the Active-Boolean
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==True),
       'Active'] = False
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
       &
       (df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==False),
       'Active'] = True

我希望仅在要约更改为无效时创建Inactive-Stamp，并仅在相反时删除/替换Inactive-Stamp。
显然，这确实会立即将每个更改为 inactive 的 active 更改回 active 。
这就是为什么我想到了在if-条件下执行这些呈现的语句的情况，该条件看起来像（伪代码）：

if (duplicate ID with different active & new entry is inactive):
    df['Active'] = False
    df['Inactive-Stamp'] = newTimeStamp
elif (duplicate ID with different active & new entry is active):
    df['Active'] = True
    df['Inactive-Stamp'] = ''

但是我既不知道如何在pandas DataFrame中实现该if条件，又不知道df.loc函数的组合，也不知道这是否是一个好习惯。

合并（并有条件地更新）Pandas DataFrames

0 个答案: