我熟悉pandas cut(),我正在寻找一种有效的方法来实现二维。
背景:我有两个数据框。 df_data由X和Y坐标组成,而df_box由感兴趣区域的左下X,左下Y,左上X,右上Y和它们各自的名称组成。我想根据它们的区域标记df_data中的坐标。
假设:(1)对于第一次传递,假设df_box中的区域是互斥的,则可以。
我编写了以下代码,速度非常快(速度非常重要,因为我在每个数据帧中处理数百万行)。但是,它有一个缺点:如果我们有不均匀间隔的区域,那么这种方法不起作用。 See illustration of what works and what doesn't here.
def label_by_bbox2(df_in, col_x, col_y, df_bbox, col_llx, col_lly, col_urx, col_ury, col_label, dropNoMatch=False):
''' assumes mutually exclusive bounding boxes, and there's only one pair of lattice constant.
Labels are treated as strings. '''
df_bbox_now = df_bbox[[col_llx, col_lly, col_urx, col_ury, col_label]]
df_bbox_now = df_bbox_now.drop_duplicates(subset=[col_llx, col_lly, col_urx, col_ury])
# NOTE: because of our assumption of mutually exclusive bbox, and only one pair of lattice constant,
# it is enough to evaluate llx and lly
# --- llx ---
bin_now = df_bbox.sort_values(col_llx)[col_llx].drop_duplicates()
binlabel_now = bin_now[ : len(bin_now)-1 ] # drop last element
df_in[col_x] = df_in[col_x].astype(float)
df_in['llx'] = pd.cut( df_in[col_x], bin_now, labels=binlabel_now).astype(float)
#df_in['llx2'] = pd.cut( df_in[col_x], bin_now)
# --- lly ---
bin_now = df_bbox.sort_values(col_lly)[col_lly].drop_duplicates()
binlabel_now = bin_now[ : len(bin_now)-1 ] # drop last element
df_in[col_y] = df_in[col_y].astype(float)
df_in['lly'] = pd.cut( df_in[col_y], bin_now, labels=binlabel_now).astype(float)
# --- update the table with labels ---
df_in = df_in.set_index([col_llx, col_lly])
df_bbox_now = df_bbox_now.set_index([col_llx, col_lly])
df_in[col_label] = df_bbox_now[col_label]
df_in = df_in.reset_index(drop=True)
df_in[col_label] = df_in[col_label].astype(str)
# --- delete entries that have no match ---
if dropNoMatch:
df_in = df_in[ df_in[col_label]!='nan' ]
df_in = df_in.reset_index(drop=True)
return df_in
总而言之,我正在寻找适用于多索引的某种cut()。有没有人想过如何处理区域不均匀间隔而不会显着牺牲计算速度的情况?例如,我不介意使用numpy。
感谢。