我正在尝试在此数据集http://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv上的3组植物生长(ctrl,trt1,trt2)上实施单向anova。我正在使用Pandas和Scipy的组合。然而,通过执行数据的逐列z分数归一化的f和p值与未执行归一化的那些相同!谁能告诉我为什么会这样呢?
import pandas as pd
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
datafile="../data/PlantGrowth.csv"
data = pd.read_csv(datafile)
weight_zscore = 'weight' + '_zscore'
data['weight_zscore'] = (data['weight']- data['weight'].mean()/data['weight'].std(ddof=0))
grps = pd.unique(data.group.values)
weight_data = {grp:data['weight'][data.group == grp] for grp in grps}
weight_zscore_data = {grp:data['weight_zscore'][data.group == grp] for grp in grps}
F, p = stats.f_oneway(weight_data['ctrl'], weight_data['trt1'], weight_data['trt2'])
Fz, pz = stats.f_oneway(weight_zscore_data['ctrl'], weight_zscore_data['trt1'], weight_zscore_data['trt2'])
print "Non-Normalized weight": F, p,
print "Normalized weight": Fz, pz
答案是:
Non-normalized weight: 4.84608786238, 0.0159099583256
Normalized weight: 4.84608786238, 0.0159099583256
答案 0 :(得分:1)
我认为因为归一化是数据集的双向变换,所以它不会影响统计检验的结果。例如,如果您正在进行均值测试,则通过从每个均值中减去5
,您不会影响测试结果。同样,通过将均值除以一个值,甚至整个数据集,您不会影响p值或其他可以计算出的分数。