Question

我有一个简单的任务，但我无法解决我的问题。

我有一个庞大的数据帧，想要执行一个KNN，但是因为我得到了以下错误，所以不能这样做：

错误：因子预测因子最多必须有32个级别

到目前为止一直很好..我的想法是聚合专栏，所以我得到的因素更少。

str(only_savings_medium$MaterialGroupCode)

Factor w/ 40 levels "1A","1B","1C",..: 11 11 11 15 15 15 15 15 15 15 ...

有40个级别的＆＃34;代码＆＃34;形式为＆＃34; 1A＆＃34;，＆＃34; 1B＆＃34;，...，＆＃34; 2B＆＃34;，＆＃34; 2D＆＃34;，...，＆＃34 ; 3A＆＃34;，......＆＃34; 3D＆＃34;，＆＃34; 4B＆＃34;，＆＃34; 4C＆＃34;，...，＆＃34; 5A＆＃34; ,. 。，＆＃34; 5Z＆＃34;。基本上我想检查因子是否包含1,2,3,4或5并将其分配给新列。带有1（任何字母）的所有代码将被分配给1,2（任何字母）到2，依此类推。最后，应该有一个只有5个因子的新列，每个因子都包含所有较小的因子。我不确定如何解释，希望你理解我的问题。

编辑：我会尝试扩展我的解释。以下是数据帧的一部分：

如您所见，有一个列具有不同的材料组代码。有40个级别。我需要的是：为此数据框创建新列。此列包含5个级别（1,2,3,4或5）。如果我们以我的截图为例 - 我们将有一个新的coulmn以下级别：2,2,2,2,2,1,1,1,1,1,1,3,3,3,3,3 ......，3。基本上每1A - 1Z，被分配到新列的第1级，每个2A - 2Z被分配到2，依此类推......

Answer 1

喜欢这样吗？

as.factor()

这将提取向量的第一个位置（在您的情况下：数字）（可能是data.frame中的列）。现在，退货属于字符型。如果您需要select * from xyzzy where z > 100 select * from xyzzy where z > 100 order by zz select * from xyzzy select z.* from xyzzy select a, b from test_table where 1=1 and b='yes' select a, b from test_table where 1=1 and b in (select bb from foo) select z.a, b from test_table where 1=1 and b in (select bb from foo) select z.a, b from test_table where 1=1 and b in (select bb from foo) order by b,c desc,d select z.a, b from test_table left join test2_table where 1=1 and b in (select bb from foo) select a, db.table.b as BBB from db.table where 1=1 and BBB='yes' select a, db.table.b as BBB from test_table,db.table where 1=1 and BBB='yes' select a, db.table.b as BBB from test_table,db.table where 1=1 and BBB='yes' limit 50，请参阅。

Answer 2

基本上你想减少级别数。这里有一些指导原则（因为你没有提供可重复的例子）

创建一个对应的data.frame，它将第一个因子与40个级别之间的映射与具有较少级别的新因子进行映射。
使用merge，将您的数据与此corespondance data.frame合并。

这是一个例子：

## the long factor , in your case 40 levels
origin_factors <- c(LETTERS[1:5],LETTERS[6:10],LETTERS[11:15])
## the target one 
dest_factors <- c("l1","l2","l3")
## the correspondence matrix
corrs <- data.frame(
  x=c(LETTERS[1:5],LETTERS[6:10],LETTERS[11:15]),
  nx=c(rep("l1",5),rep("l2",5),rep("l3",5))
  )
## create a reproducible example 
ex <- sample(sample(origin_factors),100,replace=T)
dat <- data.frame(x=ex)
## merge to reduce the number of levels. 
merge(dat,corrs)

Answer 3

好吧，我终于能够解决我的问题..因为我是初学者，你提供给我的代码对我来说太复杂了。这就是我做的：

我已经复制了整个专栏＆＃34; MaterialGroupCode＆＃34;并将其绑定到具有不同名称的相同DF。所以基本上我有相同的DF +＃34; MaterialGroupCode＆＃34; -column的副本，名称为＆＃34; MDC＆＃34;。

my_df$MDC <- substring(my_df$MDC,1 ,1)

所以我做了一个子串，因为我只需删除这封信。最后它是一个角色，所以我唯一要做的就是：

my_df$MDC <- as.factor(my_df$MDC)

现在我有一个新的MDF列，这是一个5级的因子，对应1A ... 1Z为1,2B ... 2Z为2等等。

R中新列中的聚合因子

3 个答案: