如何计算第1和第3四分位数?

时间:2017-08-28 19:38:23

标签: python python-2.7 pandas numpy

我有DataFrame:

    time_diff   avg_trips
0   0.450000    1.0
1   0.483333    1.0
2   0.500000    1.0
3   0.516667    1.0
4   0.533333    2.0

我希望获得列time_diff的第1四分位数,第3四分位数和中位数。要获得中位数,我使用np.median(df["time_diff"].values)

如何计算四分位数?

12 个答案:

答案 0 :(得分:19)

您可以使用np.percentile来计算四分位数(包括中位数):

>>> np.percentile(df.time_diff, 25)  # Q1
0.48333300000000001

>>> np.percentile(df.time_diff, 50)  # median
0.5

>>> np.percentile(df.time_diff, 75)  # Q3
0.51666699999999999

或者一下子:

>>> np.percentile(df.time_diff, [25, 50, 75])
array([ 0.483333,  0.5     ,  0.516667])

答案 1 :(得分:17)

使用ifelse(d>40, "Red", "Black")

ifelse

答案 2 :(得分:9)

巧合的是,此信息是使用describe方法捕获的:

df.time_diff.describe()

count    5.000000
mean     0.496667
std      0.032059
min      0.450000
25%      0.483333
50%      0.500000
75%      0.516667
max      0.533333
Name: time_diff, dtype: float64

答案 3 :(得分:5)

使用pandas

df.time_diff.quantile([0.25,0.5,0.75])


Out[793]: 
0.25    0.483333
0.50    0.500000
0.75    0.516667
Name: time_diff, dtype: float64

来自How do you find the IQR in Numpy?

的回答

答案 4 :(得分:3)

在赛勒斯所说的话的基础上或更正一点。……

[np.percentile][1] 非常多计算Q1,中位数和Q3的值。考虑下面的排序列表:

s1=[18,45,66,70,76,83,88,90,90,95,95,98]

运行np.percentile(s1, [25, 50, 75])返回列表中的实际值:

[69.   85.5  91.25]

但是,四分位数是Q1 = 68.0,中位数= 85.5,Q3 = 92.5,这是正确要说的话

我们在这里缺少的是np.percentile和相关函数的插值参数。默认情况下,此参数的值为 linear 。此可选参数指定当所需分位数位于两个数据点i 之间时要使用的插值方法 线性:i +(j-i)*分数,其中分数是被i和j包围的索引的分数部分。
下:i。
更高:j。
最接近:i或j,以最接近的那个为准。
中点:(i + j)/ 2。

因此运行np.percentile(s1, [25, 50, 75], interpolation='midpoint')将返回列表的实际结果:

[68.  85.5 92.5]

答案 5 :(得分:2)

np.percentile 计算Q1,中位数和Q3的值。考虑下面的排序列表:

samples = [1, 1, 8, 12, 13, 13, 14, 16, 19, 22, 27, 28, 31]

运行np.percentile(samples, [25, 50, 75])返回列表中的实际值:

Out[1]: array([12., 14., 22.])

但是,四分位数为Q1=10.0, Median=14, Q3=24.5(您也可以使用此link在线查找四分位数和中位数)。 可以使用以下代码来计算已排序列表的四分位数和中位数(因为对这种方法进行排序需要进行O(nlogn)计算,其中n是项数)。 此外,可以使用Median of medians选择算法(order statistics)在O(n)计算中找到四分位数和中位数。

samples = sorted([28, 12, 8, 27, 16, 31, 14, 13, 19, 1, 1, 22, 13])

def find_median(sorted_list):
    indices = []

    list_size = len(sorted_list)
    median = 0

    if list_size % 2 == 0:
        indices.append(int(list_size / 2) - 1)  # -1 because index starts from 0
        indices.append(int(list_size / 2))

        median = (sorted_list[indices[0]] + sorted_list[indices[1]]) / 2
        pass
    else:
        indices.append(int(list_size / 2))

        median = sorted_list[indices[0]]
        pass

    return median, indices
    pass

median, median_indices = find_median(samples)
Q1, Q1_indices = find_median(samples[:median_indices[0]])
Q2, Q2_indices = find_median(samples[median_indices[-1] + 1:])

quartiles = [Q1, median, Q2]

print("(Q1, median, Q3): {}".format(quartiles))

答案 6 :(得分:2)

您可以使用

df.describe()

将显示信息

df.describe()

答案 7 :(得分:2)

如果要使用原始python而不是numpy或panda,则可以使用python stats模块查找列表上半部分和下半部分的中位数:

    >>> import statistics as stat
    >>> def quartile(data):
            data.sort()               
            half_list = int(len(data)//2)
            upper_quartile = stat.median(data[-half_list]
            lower_quartile = stat.median(data[:half_list])
            print("Lower Quartile: "+str(lower_quartile))
            print("Upper Quartile: "+str(upper_quartile))
            print("Interquartile Range: "+str(upper_quartile-lower_quartile)

    >>> quartile(df.time_diff)

第1行:在“ stat”别名下导入统计信息模块

第2行:定义四分位数功能

第3行:按升序对数据进行排序

第4行:获取列表长度的一半

第5行:获取列表下半部分的中位数

第6行:获取列表上半部分的中位数

第7行:打印下四分位数

第8行:打印上四分位数

第9行:打印四分位间距

第10行:为DataFrame的time_diff列运行四分位数功能

答案 8 :(得分:0)

在学习面向对象程序设计以及学习统计信息的过程中,我做到了这一点,也许您会发现它很有用:

samplesCourse = [9, 10, 10, 11, 13, 15, 16, 19, 19, 21, 23, 28, 30, 33, 34, 36, 44, 45, 47, 60]

class sampleSet:
    def __init__(self, sampleList):
        self.sampleList = sampleList
        self.interList = list(sampleList) # interList is sampleList alias; alias used to maintain integrity of original sampleList

    def find_median(self):
        self.median = 0

        if len(self.sampleList) % 2 == 0:
            # find median for even-numbered sample list length
            self.medL = self.interList[int(len(self.interList)/2)-1]
            self.medU = self.interList[int(len(self.interList)/2)]
            self.median = (self.medL + self.medU)/2

        else:
            # find median for odd-numbered sample list length
            self.median = self.interList[int((len(self.interList)-1)/2)]
        return self.median

    def find_1stQuartile(self, median):
        self.lower50List = []
        self.Q1 = 0

        # break out lower 50 percentile from sampleList
        if len(self.interList) % 2 == 0:
            self.lower50List = self.interList[:int(len(self.interList)/2)]
        else:
            # drop median to make list ready to divide into 50 percentiles
            self.interList.pop(interList.index(self.median))
            self.lower50List = self.interList[:int(len(self.interList)/2)]

        # find 1st quartile (median of lower 50 percentiles)
        if len(self.lower50List) % 2 == 0:
            self.Q1L = self.lower50List[int(len(self.lower50List)/2)-1]
            self.Q1U = self.lower50List[int(len(self.lower50List)/2)]
            self.Q1 = (self.Q1L + self.Q1U)/2

        else:
            self.Q1 = self.lower50List[int((len(self.lower50List)-1)/2)]

        return self.Q1

    def find_3rdQuartile(self, median):
        self.upper50List = []
        self.Q3 = 0

        # break out upper 50 percentile from sampleList
        if len(self.sampleList) % 2 == 0:
            self.upper50List = self.interList[int(len(self.interList)/2):]
        else:
            self.interList.pop(interList.index(self.median))
            self.upper50List = self.interList[int(len(self.interList)/2):]

        # find 3rd quartile (median of upper 50 percentiles)
        if len(self.upper50List) % 2 == 0:
            self.Q3L = self.upper50List[int(len(self.upper50List)/2)-1]
            self.Q3U = self.upper50List[int(len(self.upper50List)/2)]
            self.Q3 = (self.Q3L + self.Q3U)/2

        else:
            self.Q3 = self.upper50List[int((len(self.upper50List)-1)/2)]

        return self.Q3

    def find_InterQuartileRange(self, Q1, Q3):
        self.IQR = self.Q3 - self.Q1
        return self.IQR

    def find_UpperFence(self, Q3, IQR):
        self.fence = self.Q3 + 1.5 * self.IQR
        return self.fence

samples = sampleSet(samplesCourse)
median = samples.find_median()
firstQ = samples.find_1stQuartile(median)
thirdQ = samples.find_3rdQuartile(median)
iqr = samples.find_InterQuartileRange(firstQ, thirdQ)
fence = samples.find_UpperFence(thirdQ, iqr)

print("Median is: ", median)
print("1st quartile is: ", firstQ)
print("3rd quartile is: ", thirdQ)
print("IQR is: ", iqr)
print("Upper fence is: ", fence)

答案 9 :(得分:0)

当试图找到一个能找到四分位数的数据包时,我也遇到了类似的问题。这并不是说其他​​人是错误的,而是说这就是我个人定义四分位数的方式。这与Shikar使用中点的结果相似,但也适用于长度为奇数的列表。如果四分位数位置在长度之间,它将使用相邻值的平均值。 (即排名始终被视为确切排名或排名的0.5)

import math

def find_quartile_postions(size):
    if size == 1:
        # All quartiles are the first (only) element
        return 0, 0, 0
    elif size == 2:
        # Lower quartile is first element, Upper quartile is second element, Median is average
        # Set to 0.5, 0.5, 0.5 if you prefer all quartiles to be the mean value
        return 0, 0.5, 1
    else:
        # Lower quartile is element at 1/4th position, median at 1/2th, upper at 3/4
        # Quartiles can be between positions if size + 1 is not divisible by 4
        return (size + 1) / 4 - 1, (size + 1) / 2 - 1, 3 * (size + 1) / 4 - 1

def find_quartiles(num_array):
    size = len(num_array)
    
    if size == 0:
        quartiles = [0,0,0]
    else:
        sorted_array = sorted(num_array)
        lower_pos, median_pos, upper_pos = find_quartile_postions(size)

        # Floor so can work in arrays
        floored_lower_pos = math.floor(lower_pos)
        floored_median_pos = math.floor(median_pos)
        floored_upper_pos = math.floor(upper_pos)

        # If position is an integer, the quartile is the elem at position
        # else the quartile is the mean of the elem & the elem one position above
        lower_quartile = (sorted_array[floored_lower_pos]
                          if (lower_pos % 1 == 0)
                          else (sorted_array[floored_lower_pos] + sorted_array[floored_lower_pos + 1]) / 2
                         )

        median = (sorted_array[floored_median_pos]
                          if (median_pos % 1 == 0)
                          else (sorted_array[floored_median_pos] + sorted_array[floored_median_pos + 1]) / 2
                         )

        upper_quartile = (sorted_array[floored_upper_pos]
                          if (upper_pos % 1 == 0)
                          else (sorted_array[floored_upper_pos] + sorted_array[floored_upper_pos + 1]) / 2
                         )

        quartiles = [lower_quartile, median, upper_quartile]

    return quartiles

答案 10 :(得分:0)

试试那个:

dfo = sorted(df.time_diff)

n=len(dfo)

Q1=int((n+3)/4)  
Q3=int((3*n+1)/4)  


print("Q1 position: ", Q1, "Q1 position: " ,Q3)

print("Q1 value: ", dfo[Q1], "Q1 value: ", dfo[Q3])

答案 11 :(得分:0)

如果你对使用 JS 感兴趣,我已经开发了一个解决方案:

var
withThis = (obj, cb) => cb(obj),
sort = array => array.sort((a, b) => a - b),

fractile = (array, parts, nth) => withThis(
  (nth * (array.length + 1) / parts),
  decimal => withThis(Math.floor(decimal),
    even => withThis(sort(array),
      sorted => sorted[even - 1] + (
        (decimal - even) * (
          sorted[even] - sorted[even - 1]
        )
      )
    )
  )
),

data = [
  78, 72, 74, 79, 74, 71, 75, 74, 72, 68,
  72, 73, 72, 74, 75, 74, 73, 74, 65, 72,
  66, 75, 80, 69, 82, 73, 74, 72, 79, 71,
  70, 75, 71, 70, 70, 70, 75, 76, 77, 67
]

fractile(data, 4, 1) // 1st Quartile is 71
fractile(data, 10, 3) // 3rd Decile is 71.3
fractile(data, 100, 82) // 82nd Percentile is 75.62

您只需将代码复制粘贴到浏览器上即可获得准确结果。 有关“使用 JS 进行统计”的更多信息,请参见 https://gist.github.com/rikyperdana/a7349c790cf5b034a1b77db64415e73c/edit