我有2个数据框,结构如下:
df1 = pd.read_csv("Main_Database.csv")
# df1 Columns: ..., Timestamp, Name, Query, Website, Status,...
df2 = pd.read_csv("New_Raw_Results.csv")
# df2 COlumns: ..., Timestamp, Name, Query, Website, Status,...
两个数据框可以具有完全相同的列。
我的Main_database.csv
跟踪所有记录,我的new_raw_results
是每周都会出现的新结果的列表。我想根据以下三种情况处理main_database
中的更改:
A)如果在DF1中找到DF2中的IF AND网站,
->使用Df2中的时间戳记,在DF1列“最后一次看到”中写入
->将状态覆盖到"STILL ACTIVE"
B)如果在DF1中找不到DF2中的查询和网站,
->将整个df2.row附加到df1
->将状态覆盖到"NET NEW"
C)如果在DF2中找不到DF1中的查询和网站,
->将状态覆盖到"EXPIRED"
我尝试使用合并和联接的组合,但是我被困在这里。例如,如果我将这两个表之间的内部联接的结果隔离在一个新的数据框中,则不确定如何使用它对我的主数据库执行操作。我试图将所有这些条件都放在一个函数下,所以我可以使用此函数来处理新条目。
您将如何构造此功能?解决这个问题的最简洁方法是什么?
答案 0 :(得分:0)
数据集
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
columns = ['query','website','timestamp','status','last_seen']
data = rng.integers(1,20,(100,5))
df1 = pd.DataFrame(data=data, columns=columns,dtype=str)
data = rng.integers(1,20,(100,5))
df2 = pd.DataFrame(data=data, columns=columns,dtype=str)
串联query
和website
列将有助于进行比较。例如
Query Website
0 query1 website1 --> 'query1website1'
为串联列的每个数据框创建一个序列
a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)
为您的三个条件中的每个条件创建一个布尔序列。
cond1 = a.isin(b) # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)
根据条件3设置状态-您的C)
df1.loc[cond3,'status'] = 'EXPIRED'
使用新信息更新-您的A)
使用numpy broadcasting将所有df2值(a
)与所有df1值(b
)进行比较,并获取它们匹配的索引。
indices1 = (a.values[:,None] == b.values).argmax(1)
(a.values[:,None] == b.values)
生成一个二维布尔数组,该数组是每个a
值与每个b
值的比较。 argmax
函数返回匹配的索引。
# df1 row indices where df1.qw == df2.qw
x = indices1[indices1 > 0]
# df2 rows where df2.qw == df1.qw
y = df2.loc[np.where(indices1 > 0)]
x
是df1
整数索引的数组,它们在df2
中具有 matches 。 y
是与x
(df2
的子集)相对应的 matches 的DataFrame。使用整数数组将新值分配给正确的df1
行。
df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"
注意:如果df1有多行,且qw
的值相同,则np.argmax将仅找到第一个,而第二个的列保持不变。使用随机数据会定期出现。
添加新行-您的B)
df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)
完成...
a = df2['query'].str.cat(df2.website)
b = df1['query'].str.cat(df1.website)
cond1 = a.isin(b) # ended up not using this
cond2 = ~cond1
cond3 = ~b.isin(a)
df1.loc[cond3,'status'] = 'EXPIRED'
indices1 = (a.values[:,None] == b.values).argmax(1)
x = indices1[indices1 > 0]
y = df2.loc[np.where(indices1 > 0)]
df1.loc[x,'last_seen'] = y.timestamp.values
df1.loc[x,'status'] = "STILL ACTIVE"
df2.loc[cond2,'status'] = "NET NEW"
df1 = pd.concat([df1,df2.loc[cond2]], ignore_index=True)
答案 1 :(得分:0)
这应该做您的工作:
import pandas as pd
data = [
{"timestamp": 1, "last_seen": 1, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 2, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 3, "status": "XXX", "website": "website3", "query": "query1"},
{"timestamp": 1, "last_seen": 4, "status": "XXX", "website": "website5", "query": "query1"},
{"timestamp": 1, "last_seen": 5, "status": "XXX", "website": "website6", "query": "query1"}
]
new_data = [
{"timestamp": 1, "last_seen": 6, "status": "XXX", "website": "website1", "query": "query1"},
{"timestamp": 1, "last_seen": 7, "status": "XXX", "website": "website2", "query": "query2"},
{"timestamp": 1, "last_seen": 8, "status": "XXX", "website": "website3", "query": "query4"},
{"timestamp": 1, "last_seen": 9, "status": "XXX", "website": "website3", "query": "query8"}
]
df = pd.DataFrame(data)
df_new = pd.DataFrame(new_data)
for i, row in df.iterrows():
tmp = df_new.loc[(df_new['website'] == row['website']) & (df_new['query'] == row['query'])]
if not tmp.empty:
# A)
df.at[i, 'last_seen'] = tmp['last_seen']
df.at[i, 'status'] = "STILL ACTIVE"
else:
# B)
df.at[i, 'status'] = "EXPIRED"
for i, row in df_new.iterrows():
# C)
tmp = df.loc[(df['website'] == row['website']) & (df['query'] == row['query'])]
if tmp.empty:
row["status"] = "NET NEW"
df = df.append(row, ignore_index=True)
print(df)