我想将Apriori算法应用于零售数据集(零售商店的购物篮数据)。它具有以下形式的数据:-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58
32 41 59 60 61 62
3 39 48
因此,为了使用Apriori算法,我需要将Python列表形式的数据放入Numpy数组中,如下所示:-
Column Names as 0 1 2 3 4 5 6 7 8 9 10........
数据集为:
0 1 2 3 4 5 6 7 8 9 10 .........30 31 32 33 34 35....
1 1 1 1 1 1 1 1 1 1 1...........0 0 0 0 0 0...
0 0 0 0 0 0 0 0 0 0 0...........1 1 1 0 0 0..
and so on..
为此,我尝试使用事务编码器:-
dataset = pd.read_csv('retail.dat', header=None)
from mlxtend.preprocessing import TransactionEncoder
transactionEncoder = TransactionEncoder()
dataset = transactionEncoder.fit(dataset).transform(dataset)
dataset.astype('int')
print(dataset)
但是我得到了错误:-
TypeError: 'int' object is not iterable
我还想将列名0 1 2 ....附加到新形成的数据集,但是print(transactionEncoder.columns_)
没有给出有效的列。请说明可能是问题所在,以及在此数据集上应用事务编码器的正确方法是什么...
答案 0 :(得分:2)
IIUC,您可以堆叠数据框并尝试crosstab
:
df = pd.read_csv('retail.dat', sep=' ', header=None)
new_df = df.stack().astype(int).reset_index(name='value')
pd.crosstab(new_df['level_0'], new_df['value'])
输出:
value 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56 57 58 ...
level_0 ...
0 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
7 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0
答案 1 :(得分:1)
您可以尝试以下方法:
import pandas as pd
import numpy as np
from io import StringIO
from mlxtend.preprocessing import TransactionEncoder
inputstr = StringIO("""0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58
32 41 59 60 61 62
3 39 48 """)
df = pd.read_csv(inputstr, header=None,sep='\s+')
df_out = df.apply(lambda x: list(x.dropna().values), axis=1).tolist()
transactionEncoder = TransactionEncoder()
dataset = transactionEncoder.fit(df_out).transform(df_out)
dataset = dataset.astype('int')
print(dataset)
输出:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
并转换为数据框:
dataset_df = pd.DataFrame(dataset)
输出:
0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56 57 58 59
0 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1
7 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0