我有一个数据集,其中1列有142个唯一值。作为构建预测模型的一部分,我想为该列创建虚拟变量。但是,我不是创建142个虚拟变量,而是首先想要对响应变量行为相似的值进行标记。我使用的代码如下所示
round(tapply(train_data$Price,train_data$Suburb,mean),0)
这给了我数组中142个不同的元素,如果我手动查找相似的值,这是很费时的。我的输出的片段粘贴在下面:
round(tapply(train_data$Price,train_data$Suburb,mean),0)
Abbotsford Aberfeldie Airport West
1057934 1235150 707542
Albert Park Albion Alphington
1919014 547711 1188880
Altona Altona North Armadale
757866 728127 1542430
Ascot Vale Ashburton Ashwood
968702 1595275 1049184
Avondale Heights Balaclava Balwyn
792321 675133 1912896
Balwyn North Bellfield Bentleigh
1769984 798778 1282869
Bentleigh East Box Hill Braybrook
1038886 1138650 646845
Brighton Brighton East Brooklyn
1864928 1607299 542182
Brunswick Brunswick East Brunswick West
952350 874927 744986
Bulleen Burnley Burwood
1142944 1150902 1167023
Camberwell Campbellfield Canterbury
1761263 447600 2284188
Carlton Carlton North Carnegie
1062721 1436615 915587
Caulfield Caulfield East Caulfield North
981417 1099000 1055575
Caulfield South Chadstone Clifton Hill
1119571 1007909 1049742
Coburg Coburg North Collingwood
851215 770902 858415
Cremorne Docklands Doncaster
943731 937500 1210059
Eaglemont East Melbourne Elsternwick
如何编写一个代码,根据条件对所有值进行分组,如平均值在600000-699999之间,700000-799999之间等等?
答案 0 :(得分:0)
我得到了完全符合我目的的代码
subset(aggregate( Price ~ Suburb,
train_data,
function(x) ifelse (mean(x)>600000 & mean(x)<700000 ,1,0) ),Price=="1")