我正在通过Coursera完成数据分析工具课程的作业,而且我遇到了我的代码。分配是找到方差分析并运行ANOVA来比较组的均值。我试图通过NESARC研究来检验更多酗酒事件的假设,看看是否与酗酒家族史有关。
我的定性变量是S2BQ3B,这是酒精滥用的数量(1-99),我的解释变量是'FAMHIST',我把S2DQ1 + S2DQ2放在一起,因为它们应该等于母亲和父亲,他们对酒精滥用说“是”。 / p>
当通过OLS摘要运行我的测试时,我收到的是我的F-Statistic的inf和我的p值的nan。我已将.dropna()添加到我的数据集中,但这似乎没有帮助我的结果。
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
data = pd.read_csv('nesarc_pds.csv', low_memory=False)
#Setting variables to numeric
data['S2BQ3B'] = data['S2BQ3B'].convert_objects(convert_numeric=True)
data['S2AQ1'] = pd.to_numeric(data['S2AQ1'])
data['S2DQ1'] = pd.to_numeric(data['S2DQ1'])
data['S2DQ2'] = pd.to_numeric(data['S2DQ2'])
#Subset data to exclude anyone who has never drank in lifetime, or any non alcoholic epsidoes in fam history
sub1=data[(data['S2BQ3B']<=99) & (data['S2DQ1']==1) & (data['S2DQ2']==1)]
sub2=sub1.copy()
sub2['S2BQ3B']=sub2['S2BQ3B'].replace(99,np.nan) # NUMBER OF EPISODES OF ALCOHOL ABUSE
sub2['S2DQ1']=sub2['S2DQ1'].replace(9,np.nan) # BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER
sub2['S2DQ2']=sub2['S2DQ2'].replace(9,np.nan) # BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER
sub2['FAMHIST']=sub2['S2DQ1'] + sub2['S2DQ2']
sub2['FAMHIST']=pd.to_numeric(sub2['FAMHIST'])
sub3=sub2.dropna()
# Using ols function for calculating the F-statistic and associated p value
# OLS - Ordinary lease squares
model1 = smf.ols(formula='S2BQ3B ~ C(FAMHIST)', data=sub3).fit()
print(model1.summary())
附件是OLS报告结果供参考。任何帮助将不胜感激!