我正在抓取一些数据(自动执行一个小时的任务),并希望将新收集的数据附加到我开始使用的数据框中,以便该数据框(以下内容:df_total
)保持不变越来越大。
每次抓取数据都会获得一个时间戳,但是我不想在再次抓取相同数据点时覆盖时间戳(可以由唯一的ID标识)。
此外,可以将条目标记为 active 或 inactive ,如果以前的 active 条目在以后的抓取中变成了 inactive ,通过添加一个新的时间戳显示该数据(自该报价处于非活动状态以来),应在数据中明确指出这一点。
我要合并和更新以下数据(df_total
,df_new
)
# df_total:
ID Names Timestamp Active Inactive-Stamp
0 121 A 20190626-120000 True
1 122 B 20190626-120000 True
2 123 C 20190626-120000 True
# df_new:
ID Names Timestamp Active Inactive-Stamp
0 122 B 20190627-140000 True
1 123 C 20190627-140000 False
2 124 D 20190627-140000 True
有效方式:
合并和更新本身的工作取决于以下方法:
def join_data(df_total, df_new):
# Doing a full outer join
df = df_total.merge(df_new, on=list(df_total), how='outer')
# Drop real duplicates (e.g. ID:122, Name:B)
df = df.drop_duplicates(subset=['ID', 'Active'], keep='first')
# "Add" the timestamp for the now inactive entry
df.loc[(df.duplicated(subset=['ID'],
keep='last') == True), 'Inactive-Stamp'] = "newTimeStamp"
# Change `active` (from `True` to `False`)
df.loc[(df.duplicated(subset=['ID'],
keep='last') == True), 'Active'] = False
# Delete the new duplicate
df = df[(df.duplicated(subset=['ID'],
keep='first') == False)]
# Reset index of new Dataframe
df.reset_index(inplace=True, drop=True)
return(df)
df_total = join_data(df_total, df_new)
这将导致DataFrame合并并按计划更新条目:
ID Names Timestamp Active Inactive-Stamp
0 121 A 20190626-120000 True
1 122 B 20190626-120000 True
2 123 C 20190626-120000 False newTimeStamp
3 124 D 20190627-140000 True
出了什么问题:
条目可以再次更改回active = True
。
这样做的问题是,某些用户只会取消激活他们的输入,然后再次激活它(这样他们才会再次显示在网站顶部)。
因此,上述方法必须在非活动数据再次处于活动状态时起作用。考虑以下(甚至更新的)数据:
# df_newer:
ID Names Timestamp Active Inactive-Stamp
0 123 C 20190628-160000 True
1 125 E 20190628-160000 True
我尝试过的事情:
我尝试通过添加(对于timestamp
和active
)来修改方法:
# Changing the Inactive-Stamp:
df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
&
(df['Active'][df.duplicated(subset=['ID'], keep='last')==True] == True),
'Inactive-Stamp'] = "Timestamp"
df.loc[(df.duplicated(subset=['ID'], keep='last') == True)
&
(df['Active'][df.duplicated(subset=['ID'], keep='last') =True] == False),
'Inactive-Stamp'] = ""
# Changing the Active-Boolean
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
&
(df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==True),
'Active'] = False
df.loc[(df.duplicated(subset=['ID'], keep='last')==True)
&
(df['Active'][df.duplicated(subset=['ID'], keep='last')==True]==False),
'Active'] = True
我希望仅在要约更改为无效时创建Inactive-Stamp
,并仅在相反时删除/替换Inactive-Stamp
。
显然,这确实会立即将每个更改为 inactive 的 active 更改回 active 。
这就是为什么我想到了在if
-条件下执行这些呈现的语句的情况,该条件看起来像(伪代码):
if (duplicate ID with different active & new entry is inactive):
df['Active'] = False
df['Inactive-Stamp'] = newTimeStamp
elif (duplicate ID with different active & new entry is active):
df['Active'] = True
df['Inactive-Stamp'] = ''
但是我既不知道如何在pandas DataFrame中实现该if
条件,又不知道df.loc
函数的组合,也不知道这是否是一个好习惯。