Question

我有一个由测量值组成的点的熊猫数据框（试图在此处简化问题：一个点是行星上的纬度/经度位置，一个测量值是图像上该点的实例化，因此像素x / y值）。我有一列用于point_id，一列用于序列号（该图像的）。我还有一列称为referenceIndex的列。参考索引是指在给定点内用作比较其他图像的参考的图像。这是一个可能的设置示例，它简化了列名和单元格值：

   PtID  SN  RI
0   A    a   0
1   A    b   0
2   A    c   0
3   B    b   1
4   B    d   1
5   C    a   1
6   C    e   1

在示例中，我有3分。点A存在于3张图像（a，b，c）上，参考图像为a。点B存在于2张图像上，并且参考图像为d（因为参考索引= 1）。点C存在于2张图像上，参考图像为e。

输入是要提取的point_id，序列号的CSV。列表可能没有列表中没有参考措施，因此我无法修改。而且，如果没有，我需要某种方法来保存它。

Example Input:  {A,a}
Example Output: row: {0}

Example Input:  {C,a}
Example Output: row: {5,6}

第一个例子很简单。第二个更难：我的输入列表只需要第5行，但是要保留对point_id C的引用度量，我还必须提取第6行。

帮助？

P.S。这是我想出的代码。 else：语句比我的原始方法要慢许多倍（原始方法无法满足我的要求。

if args.flagKEEPREF == "false":
    newDF_idx =[]
    for measure in arr_measures:
        test = (df.loc[(df.point_id == measure[0]) & (df.serialnumber == measure[1])]).index
        if len(test) > 0:           #account for case where the measure does not exist in the network
            newDF_idx.append(test[0])
    df = df.iloc[list(set(newDF_idx))]
    df.sort_values(by=['point_id', 'serialnumber'])
    df = df.reset_index(drop=True)
else:
    gb = [x for _, x in df.groupby('point_id')] #create groups based on point_id
    newDF_idx = []
    for group in gb:
        referenceIndex = group.iloc[0]['referenceIndex']
        for measure in arr_measures:
            if measure[0] in group['point_id'].values:  #optimization: can pre-screen
                test = (group.loc[(group.point_id == measure[0]) & (group.serialnumber == measure[1])]).index
                if len(test) > 0:                       #account for case where the measure does not exist in the network
                    newDF_idx.append(test[0])
                    newDF_idx.append(group.index[referenceIndex]) #use the index within the group to get the index in the original array
    df = df.iloc[list(set(newDF_idx))]
    df.sort_values(by=['point_id', 'serialnumber'])
    df = df.reset_index(drop=True)

如何基于引用列索引的列值从Pandas DataFrame中快速提取行？

0 个答案: