如何在一个热编码数据帧中找到唯一组合?

时间:2019-02-19 20:01:31

标签: pandas apriori

我有一个名为test的数据框,看起来像这样

+-------+---------+---------+---------+------------+
|       | Term 1  | Term 2  | Term 3  | Final Exam |
+-------+---------+---------+---------+------------+
| 1288  |      0  |      0  |      1  |          1 |
| 1290  |      1  |      1  |      1  |          1 |
| 1294  |      0  |      0  |      1  |          1 |
| 1296  |      1  |      1  |      1  |          1 |
| 1297  |      1  |      1  |      1  |          1 |
| 1304  |      0  |      1  |      1  |          1 |
| 1308  |      0  |      0  |      1  |          1 |
| 1324  |      1  |      1  |      1  |          1 |
| 1325  |      1  |      1  |      1  |          1 |
| 1332  |      1  |      1  |      1  |          1 |
+-------+---------+---------+---------+------------+

我想要一个所有唯一组合的汇总表,其中column = 1及其出现的次数:

+-----------------------------------+-----------+
|            Combination            | Frequency |
+-----------------------------------+-----------+
| Term 3, Final Exam                |         3 |
| Term 2, Term 3, Final Exam        |         1 |
| Term 1, Term2, Term 3, Final Exam |         6 |
+-----------------------------------+-----------+

我尝试使用mlxtend.apriori,但这使我出现了所有列在一起:

from mlxtend.frequent_patterns import apriori
results = apriori(test,min_support=0.00001,use_colnames=True)
results['length'] = results['itemsets'].apply(lambda x:len(x))
numberofcases = test.shape[0]
results['Frequency'] = results['support'] * numberofcases
results['Terms'] = results['itemsets'].astype(str).str.replace('frozenset\({','').str.replace('}\)','').str.replace('\'','').str.replace('\"','')
results[results['length'] > 1][['Terms','Frequency']]

结果集:

+-----+-------------------------------------+-----------+
|     |               Terms                 | Frequency |
+-----+-------------------------------------+-----------+
|  4  | Term 2, Term 1                      |       6.0 |
|  5  | Term 3, Term 1                      |       6.0 |
|  6  | Final Exam, Term 1                  |       6.0 |
|  7  | Term 2, Term 3                      |       7.0 |
|  8  | Term 2, Final Exam                  |       7.0 |
|  9  | Term 3, Final Exam                  |      10.0 |
| 10  | Term 2, Term 3, Term 1              |       6.0 |
| 11  | Term 2, Final Exam, Term 1          |       6.0 |
| 12  | Term 3, Final Exam, Term 1          |       6.0 |
| 13  | Term 2, Term 3, Final Exam          |       7.0 |
| 14  | Term 2, Term 3, Final Exam, Term 1  |       6.0 |
+-----+-------------------------------------+-----------+

先验中是否有一些参数可以产生期望的结果?

1 个答案:

答案 0 :(得分:2)

使用dotvalue_counts

df.dot(df.columns+',').str[:-1].value_counts()
Out[419]: 
Term1,Term2,Term3,FinalExam    6
Term3,FinalExam                3
Term2,Term3,FinalExam          1
dtype: int64