我有一个非常标准的pandas DataFrame,看起来像这样:
df[:5]
AREA REG PRO CAP
0 1 1 1 0
1 1 1 1 0
2 1 1 1 0
3 1 1 1 0
4 1 1 1 0
我的目标是创建一个新列“ GROUP”,该列由两个值组成,分别为1(用于第一组)和2(用于第二组),如下所示:
AREA REG PRO CAP GROUP
0 1 1 1 0 1
1 1 1 1 0 2
2 1 1 1 0 1
3 1 1 1 0 2
4 1 1 1 0 1
要填充该列,我需要使用不同的条件进行余额检查: 知道在所有这些条件下找到一个完美的匹配或多或少是不可能的,该脚本在每次检查中都允许++ 1错误。
GROUP BY AREA 50% 1 and 50% 2
and
GROUP BY REG 50% 1 and 50% 2
and
GROUP By PRO 50% 1 and 50% 2
and
GROUP By CAP 50% 1 and 50% 2
ipdb> df.groupby(['AREA'])['GROUP'].value_counts()
AREA GROUP
1 0 367
2 2
1 1
2 0 287
1 2
3 0 271
4 0 305
1 2
5 0 150
1 1
2 1
例如,可以设置1(tot:23)和2(tot:24) 但必须满足所有条件才能说它是可行的。
此外,我们收到了更多的复杂性
inside the AREA, REG, PRO, CAP group, I need equal distribution on CAP
like GROUP BY ['CAP','AREA'] 50% 1 and 50 % 2
ipdb> df.groupby(['CAP','AREA'])['GROUP'].value_counts()
CAP AREA GROUP
0 1 0 261
2 2
1 1
2 0 203
1 2
3 0 155
4 0 235
1 1
5 0 103
1 1
1 1 0 106
2 0 84
3 0 116
4 0 70
1 1
5 0 48
在此处的示例中,未分配的“ 203”(0)值应变为1(100)和2(103)以匹配请求:1(102)和2(103) 到目前为止,我所做的事情:
我的数据集并不是真正的庞大(1500行),而第一个难题在于一种简单的回溯算法(数独):
def find_empty_location(df, ran, col, l):
for index,val in ran[col].items():
if df[col][index] == 0:
l[0]=index
return True
return False
def check_pro(df, cap, pro):
value = df.groupby(['CAP','PROV'])['GROUP'].value_counts()[cap].get(pro,0)
if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
return True
valueG = df.groupby(['PROV'])['GROUP'].value_counts().get(pro,0)
if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
return True
return False
def check_reg(df, cap, reg):
value = df.groupby(['CAP','REG'])['GROUP'].value_counts()[cap].get(reg,0)
if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
return True
valueG = df.groupby(['REG'])['GROUP'].value_counts().get(reg,0)
if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
return True
return False
def check_area(df, cap, area):
value = df.groupby(['CAP','AREA'])['GROUP'].value_counts()[cap].get(area,0)
if not value.get(1,0) - 1 <= value.get(2,0) <= value.get(1,0) + 1:
return True
valueG = df.groupby(['AREA'])['GROUP'].value_counts().get(area,0)
if not valueG.get(1,0) - 1 <= valueG.get(2,0) <= valueG.get(1,0) + 1:
return True
return False
def check_location_is_safe(df, cap, pro, reg, area):
return not check_pro(df, cap, pro) and not check_reg(df, cap, reg) and not check_area(df, cap, area)
def solve_sudoku(df, ran, counter):
l=[0,0]
col = "GROUP"
if(not find_empty_location(df,ran,col,l)):
return True
row=l[0]
print ("processing ROW: " + str(row) + " counter: " +str(counter))
# consider digits 1 and 2
for num in range(1,3):
# if looks promising
if(check_location_is_safe(df, df['CAP'][row], df['PROV'][row], df['REG'][row], df['AREA'][row])):
# make tentative assignment
df[col][row]=num
counter = counter + 1
# return, if success, ya!
if(solve_sudoku(df, ran, counter)):
return True
# failure, unmake & try again
df[col][row]= 0
# this triggers backtracking
counter = counter - 1
return False
# Driver main function to test above functions
if __name__=="__main__":
df = pd.read_csv("sample_full.csv", sep = ";")
df['GROUP'] = 0
counter = 0
df.fillna(0, inplace = True)
# i use a shuffled DataFrame to random access the next free cell
ran = df.sample(frac=1)
if(solve_sudoku(df, ran, counter)):
print ("solution exist")
df.to_csv("solution_SD3.csv")
else:
print ("No solution exists")
我肯定知道存在解决方案(我有一个手动平衡的数据集作为参考)。
我真的不需要快速计算,如果需要的话,甚至可能需要一天的时间。
此外,我正在尝试一种遗传算法方法(我很久以前研究过),但是由于可以找到有限的确定性解决方案,因此我不确定是解决该问题的正确方法。>
我认为我的问题出在我编码的条件周围...或者我对此问题的思考过多。
当然,以这种方式解决它不是强制性的...
感谢您的时间
PS。遗憾的是,我无法共享原始数据表,但我可以描述一下:
ipdb> df.describe()
AREA REG PROV CAP GROUP
count 1389.000000 1389.000000 1389.000000 1389.000000 1389.000000
mean 2.699064 9.696904 48.421886 0.305976 0.008639
std 1.357604 5.903486 29.398053 0.460985 0.113550
min 1.000000 1.000000 1.000000 0.000000 0.000000
25% 1.000000 4.000000 22.000000 0.000000 0.000000
50% 3.000000 9.000000 50.000000 0.000000 0.000000
75% 4.000000 15.000000 72.000000 1.000000 0.000000
max 5.000000 20.000000 111.000000 1.000000 2.000000