Python-Pandas将数据集分为两个条件复杂的平衡组

时间:2019-11-27 23:35:57

标签: python pandas pandas-groupby backtracking balanced-groups

我有一个非常标准的pandas DataFrame,看起来像这样:

df[:5]
   AREA  REG  PRO  CAP
0     1    1     1    0
1     1    1     1    0
2     1    1     1    0
3     1    1     1    0
4     1    1     1    0

我的目标是创建一个新列“ GROUP”,该列由两个值组成,分别为1(用于第一组)和2(用于第二组),如下所示:

   AREA  REG  PRO  CAP  GROUP
0     1    1     1    0      1
1     1    1     1    0      2
2     1    1     1    0      1
3     1    1     1    0      2
4     1    1     1    0      1

要填充该列,我需要使用不同的条件进行余额检查: 知道在所有这些条件下找到一个完美的匹配或多或少是不可能的,该脚本在每次检查中都允许++ 1错误。

GROUP BY AREA 50% 1 and 50% 2
and
GROUP BY REG 50% 1 and 50% 2
and
GROUP By PRO 50% 1 and 50% 2
and 
GROUP By CAP 50% 1 and 50% 2

ipdb> df.groupby(['AREA'])['GROUP'].value_counts()
AREA  GROUP
1     0        367
      2          2
      1          1
2     0        287
      1          2
3     0        271
4     0        305
      1          2
5     0        150
      1          1
      2          1

例如,可以设置1(tot:23)和2(tot:24) 但必须满足所有条件才能说它是可行的。

此外,我们收到了更多的复杂性

inside the AREA, REG, PRO, CAP group, I need equal distribution on CAP
like GROUP BY ['CAP','AREA'] 50% 1 and 50 % 2

ipdb> df.groupby(['CAP','AREA'])['GROUP'].value_counts()
CAP  AREA  GROUP
0    1     0        261
           2          2
           1          1
     2     0        203
           1          2
     3     0        155
     4     0        235
           1          1
     5     0        103
           1          1
1    1     0        106
     2     0         84
     3     0        116
     4     0         70
           1          1
     5     0         48

在此处的示例中,未分配的“ 203”(0)值应变为1(100)和2(103)以匹配请求:1(102)和2(103) 到目前为止,我所做的事情:

我的数据集并不是真正的庞大(1500行),而第一个难题在于一种简单的回溯算法(数独):

def find_empty_location(df, ran, col, l): 
    for index,val in ran[col].items(): 
        if df[col][index] == 0:
            l[0]=index
            return True
    return False

def check_pro(df, cap, pro):
    value  = df.groupby(['CAP','PROV'])['GROUP'].value_counts()[cap].get(pro,0)
    if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
        return True
    valueG  = df.groupby(['PROV'])['GROUP'].value_counts().get(pro,0)
    if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
        return True
    return False

def check_reg(df, cap, reg):
    value  = df.groupby(['CAP','REG'])['GROUP'].value_counts()[cap].get(reg,0)
    if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
        return True
    valueG  = df.groupby(['REG'])['GROUP'].value_counts().get(reg,0)
    if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
        return True
    return False

def check_area(df, cap, area):
    value  = df.groupby(['CAP','AREA'])['GROUP'].value_counts()[cap].get(area,0)
    if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
        return True
    valueG  = df.groupby(['AREA'])['GROUP'].value_counts().get(area,0)
    if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
        return True
    return False


def check_location_is_safe(df, cap, pro, reg, area): 

    return not check_pro(df, cap, pro) and not check_reg(df, cap, reg) and not check_area(df, cap, area)


def solve_sudoku(df, ran, counter): 

    l=[0,0] 
    col = "GROUP"
    if(not find_empty_location(df,ran,col,l)): 
        return True

    row=l[0] 
    print ("processing ROW: " + str(row) + " counter: " +str(counter))

    # consider digits 1 and 2 
    for num in range(1,3): 

        # if looks promising 
        if(check_location_is_safe(df, df['CAP'][row], df['PROV'][row], df['REG'][row], df['AREA'][row])): 

            # make tentative assignment 
            df[col][row]=num 
            counter = counter + 1
            # return, if success, ya! 
            if(solve_sudoku(df, ran, counter)): 
                return True
            # failure, unmake & try again 
            df[col][row]= 0

    # this triggers backtracking         
    counter = counter - 1
    return False 

# Driver main function to test above functions 
if __name__=="__main__": 

    df = pd.read_csv("sample_full.csv", sep = ";")
    df['GROUP'] =   0
    counter = 0

    df.fillna(0, inplace = True)
    # i use a shuffled DataFrame to random access the next free cell
    ran = df.sample(frac=1)


    if(solve_sudoku(df, ran, counter)): 
        print ("solution exist")
        df.to_csv("solution_SD3.csv")
    else: 
        print ("No solution exists")

我肯定知道存在解决方案(我有一个手动平衡的数据集作为参考)。

我真的不需要快速计算,如果需要的话,甚至可能需要一天的时间。

此外,我正在尝试一种遗传算法方法(我很久以前研究过),但是由于可以找到有限的确定性解决方案,因此我不确定是解决该问题的正确方法。

我认为我的问题出在我编码的条件周围...或者我对此问题的思考过多。

当然,以这种方式解决它不是强制性的...

感谢您的时间

PS。遗憾的是,我无法共享原始数据表,但我可以描述一下:

ipdb> df.describe()
              AREA          REG         PROV          CAP        GROUP
count  1389.000000  1389.000000  1389.000000  1389.000000  1389.000000
mean      2.699064     9.696904    48.421886     0.305976     0.008639
std       1.357604     5.903486    29.398053     0.460985     0.113550
min       1.000000     1.000000     1.000000     0.000000     0.000000
25%       1.000000     4.000000    22.000000     0.000000     0.000000
50%       3.000000     9.000000    50.000000     0.000000     0.000000
75%       4.000000    15.000000    72.000000     1.000000     0.000000
max       5.000000    20.000000   111.000000     1.000000     2.000000

0 个答案:

没有答案
相关问题