如何计算数据帧pandas-python中值的条件概率?

时间:2016-06-14 16:59:24

标签: python pandas dataframe probability

我想在评级栏中计算评分的条件概率(' A' B'' C')。

    company     model    rating   type
0   ford       mustang     A      coupe
1   chevy      camaro      B      coupe
2   ford       fiesta      C      sedan
3   ford       focus       A      sedan
4   ford       taurus      B      sedan
5   toyota     camry       B      sedan

输出:

Prob(rating=A) = 0.333333 
Prob(rating=B) = 0.500000 
Prob(rating=C) = 0.166667 

Prob(type=coupe|rating=A) = 0.500000 
Prob(type=sedan|rating=A) = 0.500000 
Prob(type=coupe|rating=B) = 0.333333 
Prob(type=sedan|rating=B) = 0.666667 
Prob(type=coupe|rating=C) = 0.000000 
Prob(type=sedan|rating=C) = 1.000000 

任何帮助,谢谢.. !!

5 个答案:

答案 0 :(得分:8)

您可以使用.groupby()和内置.div()

rating_probs = df.groupby('rating').size().div(len(df))

rating
A    0.333333
B    0.500000
C    0.166667

和条件probs:

df.groupby(['type', 'rating']).size().div(len(df)).div(rating_probs, axis=0, level='rating')

coupe  A         0.500000
       B         0.333333
sedan  A         0.500000
       B         0.666667
       C         1.000000

答案 1 :(得分:3)

您需要添加reindex,以便为缺少的对添加Car值:

0

另一个解决方案,谢谢Zero

mux = pd.MultiIndex.from_product([df['rating'].unique(), df['type'].unique()])
s = (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
s = s.reindex(mux, fill_value=0)
print (s)
A  coupe    0.500000
   sedan    0.500000
B  coupe    0.333333
   sedan    0.666667
C  coupe    0.000000
   sedan    1.000000
Name: model, dtype: float64

答案 2 :(得分:2)

您可以使用groupby

In [2]: df = pd.DataFrame({'company': ['ford', 'chevy', 'ford', 'ford', 'ford', 'toyota'],
                     'model': ['mustang', 'camaro', 'fiesta', 'focus', 'taurus', 'camry'],
                     'rating': ['A', 'B', 'C', 'A', 'B', 'B'],
                     'type': ['coupe', 'coupe', 'sedan', 'sedan', 'sedan', 'sedan']})

In [3]: df.groupby('rating').count()['model'] / len(df)
Out[3]:
rating
A    0.333333
B    0.500000
C    0.166667
Name: model, dtype: float64

In [4]: (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
Out[4]:
rating  type
A       coupe    0.500000
        sedan    0.500000
B       coupe    0.333333
        sedan    0.666667
C       sedan    1.000000
Name: model, dtype: float64

答案 3 :(得分:0)

首先,转换为熊猫数据框。这样,您就可以利用熊猫的groupby方法。

    Dim wa As Microsoft.Office.Interop.Word.Application
    Dim wd As Microsoft.Office.Interop.Word.Document
    Dim wp As Microsoft.Office.Interop.Word.Paragraph
    'Dim section As Microsoft.Office.Interop.Word.Section
    Dim wp1 As Microsoft.Office.Interop.Word.Paragraph

    wa = CreateObject("word.application")
    wa.Visible = False
    wd = wa.Documents.Add


    wp1 = wd.Content.Paragraphs.Add
    wp1.Range.Font.Bold = True
    wp1.Range.Text = DateTimePicker1.Text

    wp = wd.Content.Paragraphs.Add
    wp.Range.Text = TextBox1.Text + vbNewLine + TextBox2.Text
    wp.Range.Font.Name = "Times New Roman"
    '  wp.Alignment.wdAlignParagraphDistribute()
    wd.SaveAs("g:\sample.docx")
    wa.Quit()

然后,根据事件(即评分)进行分组。

collection = {"company": ["ford", "chevy", "ford", "ford", "ford", "toyota"],
              "model": ["mustang", "camaro", "fiesta", "focus", "taurus", "camry"],
              "rating": ["A", "B", "C", "A", "B", "B"],
              "type": ["coupe", "coupe", "sedan", "sedan", "sedan", "sedan"]}

df = pd.DataFrame(collection)

答案 4 :(得分:0)

pd.crosstab(df.type, df.rating, margins=True, normalize="index")

   rating     A       B       C
   type                           
   coupe   0.500000  0.5  0.000000
   sedan   0.250000  0.5  0.250000
   All     0.333333  0.5  0.166667

这里的 All 行给出了 A、B 和 C 的概率,现在是条件概率。

pd.crosstab(df.type, df.rating, margins=True, normalize="columns")

 rating   A      B       C     All
 type                                
 coupe   0.5  0.333333  0.0  0.333333
 sedan   0.5  0.666667  1.0  0.666667

此处您的条件概率在表中,例如,给定类型的条件概率是轿跑车,它在轿跑车行和 A 列中的 A 评级为 0.5。 概率(type=coupe|rating=A) = 0.5