Question

在尝试复制CART上此lecture的幻灯片28上给出的“数字识别”示例时，我无法弄清楚如何根据指定的分布创建包含200个样本的数据集。

# columns to be used for specified distribution
Digit <- c(1,2,3,4,5,6,7,8,9,0)
X1 <- c(0,1,1,0,1,1,1,1,1,1)
X2 <- c(0,0,0,1,1,1,0,1,1,1)
X3 <- c(1,1,1,1,0,0,1,1,1,1)
X4 <- c(0,1,1,1,1,1,0,1,1,0)
X5 <- c(0,1,0,0,0,1,0,1,0,1)
X6 <- c(1,0,1,1,1,1,1,1,1,1)
X7 <- c(0,1,1,0,1,1,0,1,1,1)

# df is the specified distribution 
df <- cbind(Digit,X1,X2,X3,X4,X5,X6,X7)

10个数字由七个水平和垂直灯的不同开关组合显示。每个数字由零和1的7维向量表示。

i 样本为 $x_{i}=(x_{i1},x_{i2},...,x_{i7})$ 。如果 $x_{ij}=1$ ， j 灯亮;如果 $x_{ij}=0$ ，则 j 熄灯。

讲座指出此示例的数据是由故障计算器生成的。七个灯中的每一个都具有独立地处于错误状态的概率0.1。训练集根据指定的分布包含200个样本。

您能帮助我了解如何配置此模拟数据吗？谢谢你的时间。

Answer 1

我讨厌那个回答自己帖子的人，但我刚刚在“rpart”文档中找到了第15页here中使用的相同示例。我会继续写下答案，但除非我听到社区的不同，否则我会在一天结束时删除这个问题。对我的疏忽感到抱歉。

# the data for this example is generated by a malfunctioning calculator 
set.seed(1953) # An auspicious year
n <- 200
y <- rep(0:9, length=200)
temp <- c(1,1,1,0,1,1,1,
          0,0,1,0,0,1,0,
          1,0,1,1,1,0,1,
          1,0,1,1,0,1,1,
          0,1,1,1,0,1,0,
          1,1,0,1,0,1,1,
          0,1,0,1,1,1,1,
          1,0,1,0,0,1,0,
          1,1,1,1,1,1,1,
          1,1,1,1,0,1,0)

# The true light pattern 0-9
lights <- matrix(temp, 10, 7, byrow = TRUE)
# Noisy lights
temp1 <- matrix(rbinom(n*7, 1, 0.9), n, 7)
temp1 <- ifelse(lights[y+1, ] == 1, temp1, 1-temp1)
# Random lights
temp2 <- matrix(rbinom(n*17, 1, 0.5), n, 17)
x <- cbind(temp1, temp2)

根据指定的分布生成模拟数据

1 个答案: