我目前正在研究一个问题,该问题涉及查看许多已购买的零件并确定我们是否成功地降低了成本。
不过,我遇到了一些问题。由于我们的购买者可以选择以任意给定数量的计量单位(UOM)输入订单,但并不总是记得输入转换因子,因此有时会遇到如下数据框所示的问题
df = pd.DataFrame(
[
['AABBCCDD','2014/2015','Q2',31737.60],
['AABBCCDD','2014/2015','Q2',31737.60],
['AABBCCDD','2014/2015','Q2',31737.60],
['AABBCCDD','2014/2015','Q3',89060.84],
['AABBCCDD','2015/2016','Q3',71586.00],
['AABBCCDD','2016/2017','Q3',89060.82],
['AABBCCDD','2017/2018','Q3',98564.40],
['AABBCCDD','2017/2018','Q3',110691.24],
['AABBCCDD','2017/2018','Q4',93390.00],
['AABBCCDD','2018/2019','Q2',90420.00],
['AABBCCDD','2018/2019','Q3',13.08],
['AABBCCDD','2018/2019','Q3',13.08]
],
columns=['PART_NO','FiscalYear','FiscalQuarter','Price'])
如您所知,最近两次购买的单位成本大大降低。这是因为我们以前购买了一件整张纸,而现在购买者选择以平方英寸的物料输入订单
现在..正确的措施是去购买者并要求他/她解决问题。我想事先了解这些问题
我尝试过透视数据
df_tab = pd.pivot_table(df, values='Price', index=['PART_NO'], columns=['FiscalYear','FiscalQuarter'], aggfunc=np.mean)
结果如下:
自然地,我有成千上万个零件将要进入此数据帧,零件编号为一行。可能会按日期而不是按季度进行,因此以上内容只是为了简化目的。
我如何处理以下两种情况
-------------编辑--------------
我结合了以下建议和其他一些启发,得出了以下解决方案
# Imports
import pyodbc
import urllib
from sql import SQL
import pandas as pd
from sqlalchemy import create_engine
# Set variables
upperQuantile = 0.8
lowerQuantile = 0.2
# Connect to server / database
params = urllib.parse.quote_plus("Driver={SQL Server Native Client 11.0};Server=LT02670;Database=staging;Trusted_Connection=yes;")
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# Create dataframe containing raw data
df = pd.read_sql(SQL(), engine)
# define upper and lower quartile ranges for outlier detection
def q1(x):
return x.quantile(lowerQuantile)
def q2(x):
return x.quantile(upperQuantile)
# define function for sorting out outliers
f = {'PO_UNIT_PRICE_CURRENT_CURRENCY': ['median', 'std', q1,q2]}
# group data and add function to data (adds columns median, std, q1 and q2)
dfgrp = df.groupby(['PART_NO']).agg(f).reset_index()
# Isolate part numbers in dataframe
dfgrpPart = pd.DataFrame(dfgrp['PART_NO'])
# Isolate value columns in dataframe
dfgrpStat = dfgrp['PO_UNIT_PRICE_CURRENT_CURRENCY']
# Join categorical data with values (this is done in order to eliminate multiindex caused py groupby function)
dfgrp = dfgrpPart.join(dfgrpStat)
# Add new columns to raw data extract
df = df.join(dfgrp.set_index('PART_NO'), on='PART_NO').reset_index()
# Remove outliers and 0-values
idx = df[df['PO_UNIT_PRICE_CURRENT_CURRENCY'] < df['q1']].index
df.drop(idx, inplace=True)
idx = df[df['PO_UNIT_PRICE_CURRENT_CURRENCY'] > df['q2']].index
df.drop(idx, inplace=True)
idx = df[df['PO_UNIT_PRICE_CURRENT_CURRENCY'] <= 0].index
df.drop(idx, inplace=True)
# Split dataframe into fiscal year chunks, and build lists of part numbers
df_14_15 = df[df['FiscalYear'].str.match('2014/2015')]['PART_NO'].to_list()
# df_15_16 = df[df['FiscalYear'].str.match('2015/2016')]['PART_NO'].to_list()
df_16_17 = df[df['FiscalYear'].str.match('2016/2017')]['PART_NO'].to_list()
# df_17_18 = df[df['FiscalYear'].str.match('2017/2018')]['PART_NO'].to_list()
df_18_19 = df[df['FiscalYear'].str.match('2018/2019')]['PART_NO'].to_list()
df_19_20 = df[df['FiscalYear'].str.match('2019/2020')]['PART_NO'].to_list()
# create one list of unique part numbers from multiple years, i have chosen only some years, as we rarely order the same parts six years running
partsList = list(set(df_14_15) & set(df_16_17) & set(df_18_19))
# Use list of part numbers to filter out raw data into output dataframe
dfAllYears = df[df['PART_NO'].isin(partsList)]
# write data to excel file for further analysis, this will overwrite existing file so be careful
dfAllYears.to_excel("output.xlsx", index=False, sheet_name='Data')
这使我能够进行分析并继续前进。
不过,我对代码并不完全满意,并认为我可能在做不必要的复杂的事情,并且没有充分利用熊猫的力量
答案 0 :(得分:1)
要正确判断某物是否是异常值,您需要向组合中添加一些统计信息。不过,这超出了您需要做的事情。
我建议只按降序排序并查看数据框中的最高值。
您可以这样做:
df = df.sort_values('Price').reset_index()
要将这些值替换为null,您只需关注索引并选择范围内的所有Price
值,然后将它们设置为None
。
答案 1 :(得分:1)
您可以采用的一种方法是,在这种情况下,对具有极高值(> 10%)的列进行过滤,但是通过更改高低可以设置极值的界限。之后,您可以使用nan将低和高替换这些值,然后在这种情况下将离群的列子集作为单独的DataFrame。
from scipy import stats
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q3', 89060.84],
['AABBCCDD', '2015/2016', 'Q3', 71586.00],
['AABBCCDD', '2016/2017', 'Q3', 89060.82],
['AABBCCDD', '2017/2018', 'Q3', 98564.40],
['AABBCCDD', '2017/2018', 'Q3', 110691.24],
['AABBCCDD', '2017/2018', 'Q4', 93390.00],
['AABBCCDD', '2018/2019', 'Q2', 90420.00],
['AABBCCDD', '2018/2019', 'Q3', 13.08],
['AABBCCDD', '2018/2019', 'Q3', 13.08]
],
columns=['PART_NO', 'FiscalYear', 'FiscalQuarter', 'Price'])
filt_df = df.loc[:, df.columns == 'Price']
low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)
filt_df = filt_df.apply(lambda x: x[(x > quant_df.loc[low, x.name]) &
(x < quant_df.loc[high, x.name])], axis=0)
filt_df = pd.concat([df.loc[:, 'PART_NO'], filt_df], axis=1)
filt_df = pd.concat([df.loc[:, 'FiscalYear'], filt_df], axis=1)
filt_df = pd.concat([df.loc[:, 'FiscalQuarter'], filt_df], axis=1)
Outliers = filt_df[filt_df.isnull().any(axis=1)]
print(Outliers)
输出:
FiscalQuarter FiscalYear PART_NO Price
7 Q3 2017/2018 AABBCCDD NaN
10 Q3 2018/2019 AABBCCDD NaN
11 Q3 2018/2019 AABBCCDD NaN
在这种情况下,我不确定索引7是对还是错。但是,您可以指定任意范围,只要它们的范围在0到1之间即可。然后使用过滤的DataFrame外观,看看哪个最突出。
答案 2 :(得分:1)
我认为将PART_NO
的每个价格与平均值进行比较将很容易显示出来(假设价格不会经常波动)。
import pandas as pd
df = pd.DataFrame(
[
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q2', 31737.60],
['AABBCCDD', '2014/2015', 'Q3', 89060.84],
['AABBCCDD', '2015/2016', 'Q3', 71586.00],
['AABBCCDD', '2016/2017', 'Q3', 89060.82],
['AABBCCDD', '2017/2018', 'Q3', 98564.40],
['AABBCCDD', '2017/2018', 'Q3', 110691.24],
['AABBCCDD', '2017/2018', 'Q4', 93390.00],
['AABBCCDD', '2018/2019', 'Q2', 90420.00],
['AABBCCDD', '2018/2019', 'Q3', 13.08],
['AABBCCDD', '2018/2019', 'Q3', 13.08]
],
columns=['PART_NO', 'FiscalYear', 'FiscalQuarter', 'Price'])
avg_df = df.groupby('PART_NO').mean(['Price'].to_frame().reset_index().rename(columns={'Price': 'AVG_PRICE'})
df = df.merge(avg_df)
df['ratio'] = df['AVG_PRICE']/df['Price']
输出:
PART_NO FiscalYear FiscalQuarter Price AVG_PRICE ratio
0 AABBCCDD 2014/2015 Q2 31737.60 61501.021667 1.937797
1 AABBCCDD 2014/2015 Q2 31737.60 61501.021667 1.937797
2 AABBCCDD 2014/2015 Q2 31737.60 61501.021667 1.937797
3 AABBCCDD 2014/2015 Q3 89060.84 61501.021667 0.690551
4 AABBCCDD 2015/2016 Q3 71586.00 61501.021667 0.859121
5 AABBCCDD 2016/2017 Q3 89060.82 61501.021667 0.690551
6 AABBCCDD 2017/2018 Q3 98564.40 61501.021667 0.623968
7 AABBCCDD 2017/2018 Q3 110691.24 61501.021667 0.555609
8 AABBCCDD 2017/2018 Q4 93390.00 61501.021667 0.658540
9 AABBCCDD 2018/2019 Q2 90420.00 61501.021667 0.680171
10 AABBCCDD 2018/2019 Q3 13.08 61501.021667 4701.912971
11 AABBCCDD 2018/2019 Q3 13.08 61501.021667 4701.912971
该比率对于异常值而言是巨大的。如果您过滤df.ratio > 5
或您决定的任何数字,那么将获得您想要的所有记录。