我希望编写一个程序,该程序将通过多列数据运行,并根据发现的异常值和空白的值创建一个新的数据框。目前,我有以下代码将值替换为“离群值”和“无数据”,但我正在努力将其转换为新的数据框。
import pandas as pd
from pandas import ExcelWriter
# Remove Initial Data Quality
outl = ['.',0,' ']
# Pull in Data
path = r"C:\Users\robert.carmody\desktop\Python\PyTest\PyTGPS.xlsx"
sheet = 'Raw Data'
df = pd.read_excel(path,sheet_name=sheet)
data = pd.read_excel(path,sheet_name=sheet)
j = 0
while j < len(df.keys()): #run through total number of columns
list(df.iloc[:,j]) #create a list of all values within the 'j' column
if type(list(df.iloc[:,j])[0]) == float:
q1 = df.iloc[:,j].quantile(q=.25)
med = df.iloc[:,j].quantile(q=.50)
q3 = df.iloc[:,j].quantile(q=.75)
iqr = q3 - q1
ub = q3 + 1.5*iqr
lb = q1 - 1.5*iqr
mylist = [] #outlier list is defined
for i in df.iloc[:,j]: #identify outliers and add to the list
if i > ub or i < lb:
mylist.append(float(i))
else:
i
if mylist == []:
mylist = ['Outlier']
else:
mylist
else:
mylist = ['Outlier']
data.iloc[:,j].replace(mylist,'Outlier',inplace=True)
j = j + 1
data = data.fillna('No Data')
#Excel
path2 = r"C:\Users\robert.carmody\desktop\Python\PyTest\PyTGPS.xlsx"
writer = ExcelWriter(path2)
df.to_excel(writer,'Raw Data')
data.to_excel(writer,'Adjusted Data')
writer.save()
答案 0 :(得分:0)
假设您的数据看起来像这样,为简单起见,上限为2,下限为0,
df = pd.DataFrame({'group':'A B C D E F'.split(' '), 'Q1':[1,1,5,2,2,2], 'Q2':[1,5,5,2,2,2],'Q3':[2,2,None,2,2,2]})
df.set_index('group', inplace=True)
即:
Q1 Q2 Q3
group
A 1 1 2.0
B 1 5 2.0
C 5 5 NaN
D 2 2 2.0
E 2 2 2.0
F 2 2 2.0
那么以下内容可能会给出您想要的:
newData = []
for quest in df.columns: #run through the columns
q1 = df[quest].quantile(q=.25)
med = df[quest].quantile(q=.50)
q3 = df[quest].quantile(q=.75)
iqr = q3 - q1
#ub = q3 + 1.5*iqr
ub = 2 #my
#lb = q1 - 1.5*iqr
lb = 0 #my
for group in df.index:
i = df.loc[group, quest]
if i > ub or i < lb: #identify outliers and add to the list
newData += [[group, quest, 'Outlier', i]]
elif (i>0 or i<=0)==False:
newData += [[group, quest, 'None', None]]
创建一个二维列表,可以轻松地在数据框中进行转换 通过
pd.DataFrame(newData)