如何创建监督数据集?

时间:2018-03-02 09:10:35

标签: python-3.x machine-learning data-science sklearn-pandas

我想创建一个包含300个特征和实例的数据集,这些特征和实例是0或1的组合(布尔值)。我必须使用一些id来指定1。我如何使用python进行操作。 例如:一个实例应该像列4,45,213,6,48应该是1和那些id的组合

1 个答案:

答案 0 :(得分:0)

希望现在还不算太晚,我理解你的问题 您要求的主要项目有两个:
    1.生成尺寸为300 * n的二维300特征布尔样本集     2.生成一个因变量,列出每个观察(行)成功的特征

这是我的方法:

#%% Imports
# Data manipulation
import numpy as np
import pandas as pd

import pprint # Print a nice output
PP = pprint.PrettyPrinter(indent=4)

#%% List columns
def list_true_columns(x):
    result = []
    for i in range(0,len(x)):
        if x[i] == 1:
            result += [i]
    return result

column_amount = 300
row_amount = 1000

#%% Sample dataset
dataset = pd.DataFrame(np.random.binomial(n=1, p=0.5, size = (row_amount, column_amount)))
# Based on the sample, calculate dependent variable 
dataset['dependent'] = dataset.apply(list_true_columns, axis = 1)
PP.pprint(dataset.head)

以下是样本的负责人:

    0   1   2   3   4   5   6   7   8   9   ... 291 292 293 294 295 296 297 298 299
0   0   1   1   0   1   1   1   0   1   0   ... 1   1   0   0   0   0   0   1   1
1   1   1   0   0   0   1   0   1   1   0   ... 0   1   1   1   0   1   1   0   1
2   0   1   0   0   1   1   0   1   0   0   ... 0   1   0   1   0   0   1   1   0
3   0   1   0   1   0   0   1   1   1   0   ... 0   0   0   0   0   1   1   0   0
4   1   0   1   1   0   0   0   0   1   0   ... 1   1   1   0   0   0   1   0   1
5   0   0   1   1   1   1   0   1   0   0   ... 1   1   0   1   0   1   1   1   0
..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ... ... ... ... ... ... ... ... ... ...
994 1   1   0   1   1   0   1   1   0   1   ... 0   0   0   1   0   0   1   0   0
995 1   0   1   0   0   0   0   1   0   0   ... 1   1   0   0   0   0   1   0   1
996 1   0   1   0   1   0   0   0   0   1   ... 1   1   0   0   0   1   1   0   1
997 0   0   0   1   0   1   1   0   0   0   ... 1   0   1   1   0   0   0   1   0
998 0   0   0   0   0   1   1   1   1   0   ... 1   0   0   0   1   1   1   1   0
999 0   0   1   0   0   0   1   1   1   1   ... 1   0   0   1   1   1   1   1   1

这是因变量的头部:

                                            dependent  
0    [1, 2, 4, 5, 6, 8, 11, 15, 17, 18, 19, 20, 21,...  
1    [0, 1, 5, 7, 8, 12, 15, 16, 17, 18, 19, 20, 24...  
2    [1, 4, 5, 7, 11, 12, 15, 16, 18, 26, 27, 28, 2...  
3    [1, 3, 6, 7, 8, 11, 12, 15, 16, 23, 25, 27, 28...  
4    [0, 2, 3, 8, 13, 16, 18, 19, 20, 21, 22, 28, 2...  
5    [2, 3, 4, 5, 7, 10, 11, 12, 13, 14, 15, 21, 24...  
..                                                 ...   
994  [0, 1, 3, 4, 6, 7, 9, 10, 11, 15, 17, 20, 21, ...  
995  [0, 2, 7, 12, 13, 14, 15, 16, 17, 19, 22, 23, ...  
996  [0, 2, 4, 9, 11, 13, 16, 17, 18, 20, 21, 23, 2...  
997  [3, 5, 6, 11, 14, 20, 21, 22, 24, 28, 30, 35, ...  
998  [5, 6, 7, 8, 13, 17, 19, 20, 22, 23, 24, 28, 3...  
999  [2, 6, 7, 8, 9, 14, 17, 18, 19, 20, 21, 22, 23...