Question

我的目标是使用“模拟”文件规范化“输入”文件。必须这样做的方法是，如果模拟文件中的条目位于同一组中，并且其位置在位置开始和位置结束之间的间隔中，则必须从data_value中减去“模拟”得分。

下面，我给出一个简化的案例，实际的表要大得多，而我的解决方案还不够快。我一直在寻找替代方法，但到目前为止，似乎没有任何方法可以解决我的问题。我相信有一个更快的方法可以解决此问题，希望有人可以帮助我。

我编写的代码完全符合我的要求：

import pandas as pd

test_in_dict = {'group': [1, 1, 1, 2, 2, 2], 
                'position_start' :[10,20,30, 40, 50, 60], 
                'position_end' : [15, 25, 35, 45, 55, 65], 
                'data_values' : [11, 12, 13, 14, 15, 16]}
test_in = pd.DataFrame(data=test_in_dict)

test_mock_dict = {'group_m': [1, 1, 1, 1, 2, 2, 2, 2], 
                  'position_m' : [11, 16, 20, 52, 42, 47, 12, 65], 
                  'score_m': [1, 1, 2, 1, 3, 1, 2, 1]}
test_mock = pd.DataFrame(data=test_mock_dict)

for index_in, row_in in test_in.iterrows():
    for index_m, row_m in test_mock.iterrows():
        if (row_in['group'] == row_m['group_m']) & \
        (row_m['position_m'] >= row_in['position_start']) & \
        (row_m['position_m'] < row_in['position_end']):
            row_in['data_values'] = row_in['data_values'] - row_m['score_m']

如何编写与上面的代码相同的内容，但如何避免双重循环而使我陷入O（NxM）复杂性，而N和M都很大（模拟文件比in文件多的条目）？ / p>

Answer 1

您想要的是一个典型的join问题。在熊猫中，我们使用merge方法。您可以将itterrows循环重写为这段代码，由于我们使用向量化方法，因此速度会更快：

# first merge your two dataframes on the key column 'group' and 'group_m'
common = pd.merge(test_in, 
                    test_mock, 
                    left_on='group', 
                    right_on='group_m')

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

输出





    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16

Answer 2

已接受的答案已经就位并且应该可以使用，但是由于OP的数据量巨大，他无法使解决方案起作用。所以我想尝试一个实验性的答案，这就是为什么我将这个添加为另一个答案而不编辑我已经接受的答案：

解决方案的额外步骤：如我们所见，cardinality变为many-to-many，因为在两个key columns中都有重复项，称为group & group_m。

因此，我查看了数据，发现每个position_start值都路由到base 10。因此，我们可以通过在第二个df'test_mock'中创建一个名为position_m_round的人工键列来减少基数，如下所示：

# make a function which rounds integers to the nearest base 10
def myround(x, base=10):
    return int(base * round(float(x)/base))

# apply this function to our 'position_m' column and create a new key column to join
test_mock['position_m_round'] = test_mock.position_m.apply(lambda x: myround(x))

    group_m position_m  score_m position_m_round
0   1       11          1       10
1   1       16          1       20
2   1       20          2       20
3   1       52          1       50
4   2       42          3       40

# do the merge again, but now we reduce cardinality because we have two keys to join
common = pd.merge(test_in, 
                    test_mock, 
                    left_on=['group', 'position_start'],
                    right_on=['group_m', 'position_m_round'])

'''
this part becomes the same as the original answer
'''

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

输出

    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16

就我而言，双重iterrows（）循环太慢了

2 个答案: