我的目标是将两列合并为第三列“优先级” (步骤1)。接下来,我在新的“优先级”列(步骤2)中对组合值的每个实例进行计数。然后,我筛选出组合值(即“优先级”)为1 (第3步)的实例。接下来,如果我们在(步骤2)中创建的列的组合值计数大于1 (步骤4)。则删除“ WO_Stat”列中已取消的每一行。
我相信之前的步骤已经正确完成;在我的代码注释中,我注释了我迷路的地方:“#以上,这在9.24上可以正常工作,但不确定是否有意义,还需要在下面进行操作。”
在下面的步骤中,我需要最多的帮助。
步骤5 对于“优先级”中计数超过1的值,只有在认为“优先级值”相同的情况下,才删除“ Order_Qty”小于16的行还有另一个大于99的“ Order_Qty”。(请注意,每个“ Priority Value”最多可以有10个计数,因此,如果说Order_Qty为10,10,9,8,2000,您可能只会删除4, 2000,2000,4000,3000,300)
如果您不能提供逻辑帮助,甚至只是帮助使该代码更快地运行,则要花费近一个小时的时间来处理4万行数据。也许我可以包括动态编程或更好地格式化列数据类型?
import pandas as pd
import numpy as np
from numpy import NaN
df = pd.read_excel("ors_final.xlsx", encoding = "ISO-8859-1", dtype=object) #used to read xls file named vlookuped but now changed to ors_final as of 2.20.19
df['Priority']= df['Priority'].astype('str')
df['Cust_PO_Number']= df['Cust_PO_Number'].astype('str')
df['Cust_PO_Number']= df['Cust_PO_Number'].astype('str')
df['Item_Number']= df['Item_Number'].astype('str')
df['Sub_Priority']= df['Sub_Priority'].astype('str')
# creating second df
df_deleted = df.copy(deep=True)
df_deleted.drop(df.index,inplace=True)
# creating variable for small value first art
LOWER_THRESHOLD = 16
#
print("1. combine po number and item number")
for i, row in df.iterrows(): #iterate through each row with with row value and row content
a = str(row['Cust_PO_Number'])
b = str(row['Item_Number'])
concat = a + b
df.set_value(i, 'Priority', concat)
#worked 9.23
print('2. Count all the duplicates of the combined values above')
seen = {}
for i, row in df.iterrows(): # now we will count the combined values, not dict keys cant have dupe values
c = row['Priority']
if c not in seen: # have not seen the letter before, we need to establish this
seen [c] = 0
seen[c] += 1 # Seen concatted values once, add one.
for i, row in df.iterrows(): #put the recorded numbers in, now we loop thorugh each row to get the value of c to call it as it's key (dict) value
c = row['Priority']
times_seen = seen[c]
df.set_value(i, 'Mfg_Co', times_seen)
print("3. Ignore instances of rowes where concat is not one")
for i, row in df.iterrows():
d = row['Mfg_Co']
if d == 1.0:
df.set_value(i,'Sub_Priority',True)
else:
df.set_value(i,'Sub_Priority',False)
print('4. Delete all rows where orders are cancelled but concated column is more than 1')
delete_these = []
for i, row in df.iterrows():
f = row['WO_Stat']
d = row['Sub_Priority']
if str(f) == 'Cancelled' and d != True:
delete_these.append(i)
df_deleted = df_deleted.append(row) # this does not append dataframe yet looking into 9.23
df.drop(delete_these, axis=0, inplace=True)
#above this was working 9.24 but had not tested the data integrity , looked pretty good tho
over_numbers = {}
for i, row in df.iterrows(): #determine if its over a number, still working out kinks 9.24
c = row['Priority']
g = row['Order_Qty']
if float(g) > float(99):
over_numbers[c] = True
#little confused on below on
print('step 5')
for i, row in df.iterrows(): # storing the numbers over 99
c = row['Priority']
if c in over_numbers:
df.set_value(i, 'Comments_Status',True)
else:
df.set_value(i,'Comments_Status',False)
#above, this was working fine 9.24 but not sure if it makes sense, also need to work on below
##
delete_these = []
for i, row in df.iterrows(): # Remove all rows that have over_number = True and also number less than 16
d = row['Sub_Priority'] # should this be changed?
f = row['Comments_Status']
if d <= LOWER_THRESHOLD and f is True: # so grouping 1st arts
delete_these.append(i) # store row number to drop later
df_deleted = df_deleted.append(row) # Add the row to other dataframe
df.drop(delete_these, axis=0, inplace=True)
#step 5 was not working as of 10.2, it was breaking out the first article data wrong
writer = pd.ExcelWriter('1start.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
writer = pd.ExcelWriter('deleted1start.xlsx', engine='xlsxwriter')
df_deleted.to_excel(writer, sheet_name='Sheet1')
writer.save()
---问题的新格式,试图使其更易于理解/帮助---
import pandas as pd
df = pd.DataFrame({'Column_A':['test1', 'test7', 'test7', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1','WO7', 'WO7', 'WO6', 'WO6', 'WO6', 'WO7'],
'Column_A_B': ['','','','','','','',], 'Satus': ['Cancelled','Cancelled', 'Active', 'Active', 'Open', 'Active', 'Active'],
'Qty': ['12', '34' , '13', '3000', '14', '88', '1500']})
请查看上面的示例数据框和我的逐步目标: 步骤1:将A列和B列合并为Column_A_B 步骤2:计算“ ColumnA_B”中值的每个实例 步骤3 过滤掉“ ColumnA_B”中只有1个值实例的行 第4步:删除“状态”列中已取消的每一行,仅删除其中已取消的行-ColumnA_B中可能有相同的值,但“状态”值不同(请注意,正在应用“第三步”过滤器) 第5步:在“ Column_A_B”的过滤器仍处于打开状态(即过滤出的计数为1)下,查看冗余值(因此,当您在“ Column_A_B_”中计数时,该值为2或更大),然后针对“分组数量”查看“数量”列。如果该组的数量小于16并且超过99,则仅删除“ QTY”为16的行。如果该组的“ QTY”全部小于99,则不要删除任何内容,如果所有的“ QTY”值为超过99个不会删除任何内容。
此逻辑的结果Df为:
import pandas as pd
goaldf = pd.DataFrame({'Column_A':['test1', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1', 'WO6', 'WO6', 'WO6', 'WO7'],
'Column_A_B': ['test1W01','test4WO6','test6WO6','test6WO6', 'test7WO7'], 'Satus': ['Cancelled', 'Active', 'Open', 'Active', 'Active'],
'Qty': ['12', '3000', '14', '88', '1500']})
答案 0 :(得分:1)
我第二次@PeterLeimbigler发表评论,但我建议您从总体上为您的代码提供一些帮助。我建议仅在绝对必要的情况下才使用iter,从个人角度来说,我发现它比进行业务的标准熊猫方法要慢得多。请参见下面的一些更改。
#To concat two columns into one as a string type
df["NewCol"] = df["Col1"].astype(str) + df["Col2"].astype(str) # assigns the concated values to the new column instead of iterating over each row, much faster this way
# To get assign count column with your data giving you a by row count of how many times NewCol's row value has been seen in total dataframe
df['Counts'] = df.groupby(['NewCol'])['NewCol'].transform('count') # The count ignores nan values
# If your intent is to just compare two rows to get a count duplicate based on both columns, keep your data as ints and do this
df['Counts'] = df.groupby(['col1', 'col2'])['coltocount'].transform('count')
# Alternate method to count values
countcol1 = df['Col1'].value_counts
counts = countcol1.to_dict() #converts to dict
df['Count'] = df['Col1'].map(counts)
# To get true false values based on a specific column's data
df["Truethiness"] = (df["ColToCompare"] == 1.0) # This can be multiple conditions if need be.
# To conditionally drop rows from a pandas dataframe
df = df.drop(df[<some condition>].index
# If you need to save the data from the conditional drop
df2 = df.drop(df[<Alternate condition of above>].index