我有两个数据帧,我们称它们为A
和B
,它们具有相同的索引(人员ID),但有些ID可能在A中而不在B中,反之亦然。另外,ID在B
中是非唯一的,而在数据帧A
中是唯一的,所以我想
我想检查B
以查看是否存在某些ID,然后将A的最大B标签列添加到该特定ID。
我尝试将下面的函数作为pandas .apply()函数的参数编写。
def add_labels_to_dataframe(train_df,
id_col_name='person_id',
label_name="max_progress",
label_filepath=LABELS_SRC_FILE,
default_value=-1,
save=True):
"""
Add labels column to train_df
:param train_df: (DataFrame)
the training dataframe that needs labels
:param id_col_name: (str)
name of the ID column to use
:param label_name: (str)
the column name of the label to use (score/progress/is_X/etc)
:param label_filepath: (str)
filepath with IDs and associated labels
:param default_value: (int, or anything)
The default label to give when a person_id has no associated label
:return: (DataFrame)
updated dataframe with labels
"""
labels_df = pd.read_csv(label_filepath)
def get_max_score(row):
"""
DataFrame function to select max score when multiple exist per ID
:param row: (DataFrame)
A single row of the dataframe being modified
:return: (int)
returns elements of a Series that becomes a new column of the DataFrame
"""
# if person_id is in labels, then get max of labels
pdb.set_trace()
pid_labels_df = labels_df[row[id_col_name].isin(labels_df[id_col_name])]
if not pid_labels_df.empty and not pd.isnull(pid_labels_df[label_name].max()):
return 1 + pid_labels_df[label_name].max()
return default_value
train_df[label_name] = train_df.apply(get_max_score, axis=1)
if save:
train_df.to_csv(LABELED_TRAIN_DF_PATH)
return train_df
ValueError :(“只能比较标记相同的Series对象”,“发生在索引0”)
我知道我可以将两个数据框索引都转换为Python列表,检查值是否存在,然后创建一个新的DataFrame将旧行映射到带标签的值或某些默认值-1,但是我正在尝试在Pandas中完成所有操作,以便利用矢量化。
有人可以帮我找出一种简单的方法来仅使用数据框操作而不是在此处强制转换为Python列表吗?
答案 0 :(得分:0)
我认为*您将可以使用groupby transform来做到这一点:
df[label_name] = df.groupby("person_id").transform("max")
*准确阅读您的代码试图执行的操作有点困难...