Question

〜我做了什么〜

在本作业的第一部分中，我必须获取有关数学，科学和阅读分数的数据（从此处：[链接] https://en.wikipedia.org/wiki/Programme_for_International_Student_Assessment_(2000_to_2012)），并将它们分别放在三个带有国家名称和分数的图表中

然后，我不得不按国家/地区名称组合图表，并找到三个分数的平均值。然后，我必须对它们进行排名，并按排名对其进行排序。

〜我需要做什么〜

接下来，我需要创建一个函数来查找不同列字符串标题（平均值，数学得分，科学得分和阅读得分）的异常值（大于标准偏差的1.8倍），并在这些得分中找到异常值，并在该标题下打印异常值所在的国家/地区。我需要输入一个字符串（平均，数学分数，科学分数，阅读分数）并获取国家/地区名称列表。

我试图适应先前的任务，在该任务中，我们必须在一组数据中定位异常值，然后将其从数据中删除。那个人有很多用于求平均值和标准差的数学运算，因此我在这里尝试简化它。在这一篇中，我只需要找出异常值发生的位置。

tempDF是使用组合图表制作的临时数据框。

tempDF

   def find_outlier(str):
         outliers = []
         found = True
         while found:

             for i in range(len(str)):   

                 mean = (tempDF[str].mean())
                 std = (tempDF[str].std())
                 distance = abs((str[i]-mean)/std)

                 if distance > 1.8:             
                     outliers = outliers.append()   
                     print("The outliers in ", str, " are ", outliers)

                     found=True
                     break
                 found = False

      find_outlier("Average")
      find_outlier("Math Score")
      find_outlier("Science Score")
      find_outlier("Reading Score")

对于每个类别（平均值，数学，科学，阅读），应打印“平均值的异常值是['China Shanghai，China，'，'Qatar'，'Peru']”等。

我现在得到的错误是“ TypeError：无法使用灵活类型执行归约”。我想的是，列中的数字可能是字符串，而不是数字。

Answer 1

tempDF = pd.DataFrame({'country': ['A']*1000+['B'], 'Income' : [10]*1000+[1000]})

def find_outlier(df, col):
    return df[abs((df[col]-df[col].mean())/df[col].std())>1.8]['country'].values
    # OR
    #return df[np.abs((df[col]-np.mean(df[col]))/np.std(df[col]))>1.8]['country'].values

print ("The outliers in {0} are {1}".format("Income", find_outlier(tempDF, "Income")))

输出：

The outliers in Income are ['B']

如何在有关测试分数的列数据中识别异常值并返回异常值的国家/地区名称

1 个答案: