如何优化可迭代的熊猫

时间:2020-05-07 21:03:43

标签: python pandas optimization modin

我有以下方法,其中,我根据一组分层规则消除数据框中的重叠间隔:

def disambiguate(arg):

    arg['length'] = (arg.end - arg.begin).abs()

    df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()

    data = []
    out = pd.DataFrame()
    for row in df.itertuples():

        test = df[df['note_id']==row.note_id].copy()

        # get overlapping intervals: 
        # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
        iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
        span_range = pd.Interval(row.begin, row.end)
        fx = test[iix.overlaps(span_range)].copy()

        maxLength = fx['length'].max()
        minLength = fx['length'].min()
        maxScore = abs(float(fx['score'].max()))
        minScore = abs(float(fx['score'].min()))

        # filter out overlapping rows via hierarchy 
        if maxScore > minScore:
            fx = fx[fx['score'] == maxScore]

        elif maxLength > minLength:
            fx = fx[fx['length'] == minScore]

        data.append(fx)

    out = pd.concat(data, axis=0)

    # randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
    out.reset_index(inplace=True)
    out = out.reindex(np.random.permutation(out.index))

    return out.drop_duplicates(subset=['begin', 'end', 'note_id'])

这很好,除了我要遍历的数据帧每个都有超过10万行的事实之外,因此这要花很长时间才能完成。我在Jupyter中使用%prun进行了各种方法的计时,似乎消耗了处理时间的方法是series.py:3719(apply) ...注意:我尝试使用modin.pandas,但这引起了还有更多问题(我不断收到Interval的错误,需要一个left小于right的值,但我不知道:我可以在此处提交GitHub问题)。

我正在寻找一种优化方法,例如使用矢量化,但是老实说,我丝毫不知道如何将其转换为矢量化形式。

以下是我的数据示例:

begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222

1 个答案:

答案 0 :(得分:0)

我想我知道问题出在哪里:我对note_id的过滤不正确,因此遍历了整个数据帧。

应该是:

    cases = set(df['note_id'].tolist())

    for case in cases:
        test = df[df['note_id']==case].copy()
        for row in df.itertuples():

            # get overlapping intervals: 
            # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
            iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
            span_range = pd.Interval(row.begin, row.end)
            fx = test[iix.overlaps(span_range)].copy()

            maxLength = fx['length'].max()
            minLength = fx['length'].min()
            maxScore = abs(float(fx['score'].max()))
            minScore = abs(float(fx['score'].min()))

            if maxScore > minScore:
                fx = fx[fx['score'] == maxScore]

            elif maxLength > minLength:
                fx = fx[fx['length'] == maxLength]

            data.append(fx)

        out = pd.concat(data, axis=0)

为了测试一个音符,在我停止对整个未经过滤的数据帧进行迭代之前,它花费了16分钟以上。现在是28秒!