请在下面查看示例DataFrame和我的代码。这是我的逐步目标:步骤1:将列A和列B合并到Column_A_B中步骤2:计算“ ColumnA_B”中值的每个实例步骤3 过滤掉“ ColumnA_B”中只有1个值实例的行。第4步:删除“状态”列中已取消的每一行,仅删除具有“状态”列中的每一行在其中取消-ColumnA_B中可能有一些具有相同值但不同的“状态”值(请注意,在应用第三步过滤器时)我在第五步之前的代码似乎有效,只是我的第五步确实卡住了 第5步:在“ Column_A_B”的过滤器仍处于打开状态(即过滤出的计数为1)下,请查看冗余值(因此,当您对“ Column_A_B_ '是2或更大),然后对于所述分组计数,请查看“数量”列。如果该组的数量小于16并且超过99,则仅删除“数量”为16的行。分组时,如果所有“ QTY”值均大于99,则“小于全部99的QTY不会删除任何内容。
import pandas as pd
import pandas as pd
import numpy as np
from numpy import NaN
import random
df = pd.DataFrame({'Column_A':['test1', 'test7', 'test7', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1','WO7', 'WO7', 'WO6', 'WO6', 'WO6', 'WO7'],
'Column_A_B': ['','','','','','','',], 'Status': ['Cancelled','Cancelled', 'Active', 'Active', 'Open', 'Active', 'Active'],
'Qty': ['12', '34' , '13', '3000', '14', '88', '1500']})
df_deleted = df.copy(deep=True)
df_deleted.drop(df.index,inplace=True)
LOWER_THRESHOLD = 16
print("1. combine col A & B ")
for i, row in df.iterrows(): #iterate through each row with with row value and row content
a = str(row['Column_A'])
b = str(row['Column_B'])
concat = a + b
df.set_value(i, 'Column_A_B', concat)
#worked 2.21
print('2. Count all the duplicates of the combined values above')
seen = {}
for i, row in df.iterrows(): # now we will count the combined values, not dict keys cant have dupe values
c = row['Column_A_B']
if c not in seen: # have not seen the letter before, we need to establish this
seen [c] = 0
seen[c] += 1 # Seen concatted values once, add one.
for i, row in df.iterrows(): #put the recorded numbers in, now we loop thorugh each row to get the value of c to call it as it's key (dict) value
c = row['Column_A_B']
times_seen = seen[c]
df.set_value(i, 'Count_Of_Value', times_seen)
#worked 2.21
print("3. Ignore instances of rowes where concat is not one, assign column True if count is 1 else false")
for i, row in df.iterrows():
d = row['Count_Of_Value']
if d == 1.0:
df.set_value(i,'True_False',True)
else:
df.set_value(i,'True_False',False)
#worked 2.21
print('4. Delete all rows where orders are cancelled but concated column is more than 1')
delete_these = []
for i, row in df.iterrows():
f = row['Status']
d = row['True_False']
if str(f) == 'Cancelled' and d != True:
delete_these.append(i)
df_deleted = df_deleted.append(row)
df.drop(delete_these, axis=0, inplace=True)
#worked 2.21 on this small df
print('step 5. Delete qty where Column_A_B is the same, has more than 1 instance, and if said grouping has a Qty above 99 and below 16, delete the value below 16, if the grouping of values all have qtys less than 100 or over 100 dont delte anything')
over_numbers = {}
for i, row in df.iterrows():
c = row['Column_A_B'] # 2.21 this appears to be where the error is, trying to replace combined column w/ wo
g = row['Qty']
d = c + str(random.randint(1,10000000)) #attempting to create unique value
df.set_value(i, 'test', d) # make column to match unique value for each qty
if float(g) > float(99):
over_numbers[d] = True
print(over_numbers)
## this issue is that it is storing values that are dupicated, so the below doesnt know which one to assing T/F to 2.21
for i, row in df.iterrows(): # storing the numbers over 99
c = row['test'] # loop through unique value
if c in over_numbers:
df.set_value(i, 'Comments_Status',True)
else:
df.set_value(i,'Comments_Status',False)
## the above appeared to lable True/False correct after adding unique values to combined column 2.21
delete_these = []
for i, row in df.iterrows(): # Remove all rows that have over_number = True and also number less than 16
d = row['Qty'] # should this be changed?
f = row['Comments_Status']
z = row['test']
if int(d) <= int(16) and f is True: # so grouping 1st arts
delete_these.append(i) # store row number to drop later
df_deleted = df_deleted.append(row) # Add the row to other dataframe
df.drop(delete_these, axis=0, inplace=True)
# end
writer = pd.ExcelWriter('keep.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
writer = pd.ExcelWriter('deleted.xlsx', engine='xlsxwriter')
df_deleted.to_excel(writer, sheet_name='Sheet1')
writer.save()
我希望程序完成时上面的数据框看起来是什么样子(上面我将其命名为keep.xlsx)应如下所示:
import pandas as pd
goaldf = pd.DataFrame({'Column_A':['test1', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1', 'WO6', 'WO6', 'WO6', 'WO7'],
'Column_A_B': ['test1W01','test4WO6','test6WO6','test6WO6', 'test7WO7'], 'Satus': ['Cancelled', 'Active', 'Open', 'Active', 'Active'],
'Qty': ['12', '3000', '14', '88', '1500']})
writer = pd.ExcelWriter('goaldf.xlsx', engine='xlsxwriter')
goaldf.to_excel(writer, sheet_name='Sheet1')
writer.save()
答案 0 :(得分:2)
按照您的解释:
"""
goal waiting
Column_A Column_B Column_A_B Status Qty
0 test1 WO1 test1W01 Cancelled 12
1 test4 WO6 test4WO6 Active 3000
2 test6 WO6 test6WO6 Open 14
3 test6 WO6 test6WO6 Active 88
4 test7 WO7 test7WO7 Active 1500
"""
import pandas as pd
import numpy as np
from numpy import NaN
df = pd.DataFrame({'Column_A':['test1', 'test7', 'test7', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1','WO7', 'WO7', 'WO6', 'WO6', 'WO6', 'WO7'],
'Status': ['Cancelled','Cancelled', 'Active', 'Active', 'Open', 'Active', 'Active'],
'Qty': ['12', '34' , '13', '3000', '14', '88', '1500']})
df_deleted = df.copy(deep=True)
df_deleted.drop(df.index,inplace=True)
#Step1
def process(r):
return r['Column_A'] + r['Column_B']
df["Column_A_B"] = df.apply(lambda row: process(row), axis = 1)
print("step 1");print(df)
#Step2
df['countAB'] = df.groupby('Column_A_B')['Qty'].transform('count')
print("step 2");print(df)
#Step3
df['True_False']=df['countAB'] == 1
print("step 3");print(df)
#Step4
todelete = df[(df['Status'] == 'Cancelled') & (df['True_False'] == False)]
df = df[(df['Status'] != 'Cancelled') | (df['True_False'] == True)]
df.drop(['countAB','True_False'], axis=1, inplace=True)
todelete.drop(['True_False', 'countAB'], axis=1, inplace=True)
df_deleted = df_deleted.append(todelete)
print("step 4");print(df);print("step 4 - deleted");print(df_deleted)
#5tep5
df['Qty'] = df['Qty'].astype(int)
df['maxAB'] = df.groupby('Column_A_B')['Qty'].transform('max')
todelete = df[(df['maxAB'] > 99) & (df['Qty'] <= 16)]
df= df[(df['maxAB'] <= 99) | (df['Qty'] > 16)]
df = df.reset_index(drop=True)
todelete.drop(['maxAB'], axis=1, inplace=True)
df_deleted = df_deleted.append(todelete)
df.drop(['maxAB'], axis=1, inplace=True)
print("step 5");print(df);print("step 5 - deleted");print(df_deleted)
输出:
Column_A Column_B Status Qty Column_A_B
0 test1 WO1 Cancelled 12 test1WO1
1 test4 WO6 Active 3000 test4WO6
2 test6 WO6 Open 14 test6WO6
3 test6 WO6 Active 88 test6WO6
4 test7 WO7 Active 1500 test7WO7
step 5 - deleted
Column_A Column_A_B Column_B Qty Status
1 test7 test7WO7 WO7 34 Cancelled
2 test7 test7WO7 WO7 13 Active
一些解释:
对于步骤1:
它只是将2列与一个lambda串联在一起,当您使用apply时,您会对每一行进行某些操作(轴= 1) 结果在新列“ Column_A_B”中
#Step1
# definition of lambda function (others ways to do exist)
def process(r):
return r['Column_A'] + r['Column_B'] # i concatenate the 2 values
df["Column_A_B"] = df.apply(lambda row: process(row), axis = 1)
print("step 1");print(df)
结果:
step 1
Column_A Column_B Status Qty Column_A_B
0 test1 WO1 Cancelled 12 test1WO1
1 test7 WO7 Cancelled 34 test7WO7
2 test7 WO7 Active 13 test7WO7
3 test4 WO6 Active 3000 test4WO6
4 test6 WO6 Open 14 test6WO6
5 test6 WO6 Active 88 test6WO6
6 test7 WO7 Active 1500 test7WO7
对于第5步:
这个想法是在每个组中创建一个新的列,该列的最大数量为Qty(这里组为Column_A_B),因此在执行此命令之后:
df['maxAB'] = df.groupby('Column_A_B')['Qty'].transform('max')
print("maxAB");print(df)
结果:
maxAB
Column_A Column_B Status Qty Column_A_B maxAB
0 test1 WO1 Cancelled 12 test1WO1 12 *max value of group test1WO1
2 test7 WO7 Active 13 test7WO7 1500 *max value of group test7WO7
3 test4 WO6 Active 3000 test4WO6 3000 *max value of group test4WO6
4 test6 WO6 Open 14 test6WO6 88 *max value of group test6WO6
5 test6 WO6 Active 88 test6WO6 88 *max value of group test6WO6
6 test7 WO7 Active 1500 test7WO7 1500 *max value of group test7WO7
您看到的是每个组的最大值都在其前面(对不起,我的英语水平)
现在,对于每个数量> 99和数量<= 16的组,我只删除数量<= 16的行。
所以下一条命令说:我保留所有对此过滤器的回答并删除数据帧
todelete = df[(df['maxAB'] > 99) & (df['Qty'] <= 16)]
因此在todelete中我想保留,但在df中我要删除(并保留所有其他行)。 我们必须使用相反的过滤器。
逻辑上=> A和b ,相反的 not(A和B)=(not A)或(not B)
所以……的“非”逻辑
df[(df['maxAB'] > 99) & (df['Qty'] <= 16)]
是:
df[(df['maxAB'] <= 99) | (df['Qty'] > 16)]
在此命令之后:
# i want to keep rows which have a Qty <= 99
# or
# rows which have a Qty > 16
df= df[(df['maxAB'] <= 99) | (df['Qty'] > 16)]
您可以通过使用变量进行简化:
filter = (df['maxAB'] > 99) & (df['Qty'] <= 16)
todelete = df[filter]
df= df[~filter]
〜filter 等效于 not filter
我重建索引(0到4)
df = df.reset_index(drop=True)
最后,您等待了最终结果(在删除临时列之后)
希望这有助于了解...