这是功能工程的一部分,它根据名为Col的列总结每个ID。相同的预处理将应用于测试集。由于数据集很大,因此可能更优选基于数据表的解决方案。
培训输入:
ID Col
A M
A M
A M
B K
B M
上述培训输入的预期输出:
ID Col_M Col_K
A 3 0 # A has 3 M in Col and 0 K in Col
B 1 1
以上是处理训练数据。对于测试数据集,如果要求映射Col_M,Col_K,意味着,如果其他值如S出现在Col中,则将被忽略。
测试输入:
ID Col
C M
C S
上述测试输入的预期输出:
ID Col_M Col_K
C 1 0 # A has 1 M in Col and 0 K in Col. S value is ignored
答案 0 :(得分:1)
可能data.table
实施首先按c("M", "K")
进行过滤,然后添加这些级别(如果它们不像第二种情况那样存在),然后运行dcast
同时指定drop = FALSE, fill = 0L
(对于缺少其中一个所需级别的情况),同时指定fun = length
(以便计算)。
对两个数据集进行测试
library(data.table)
### First example
df <- fread("ID Col
A M
A M
A M
B K
B M")
dcast(df[Col %in% c("M", "K")], # Work only with c("M", "K")
ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), # Add missing levels
drop = FALSE, # Keep missing levels in output
fill = 0L, # Fill missing values with zeroes instead of NAs
fun = length) # Count. you can also specify 'value.var'
# ID M K
# 1: A 3 0
# 2: B 1 1
### Second example
df <- fread("ID Col
C M
C S")
dcast(df[Col %in% c("M", "K")],
ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))),
drop = FALSE,
fill = 0L,
fun = length)
# ID M K
# 1: C 1 0
答案 1 :(得分:0)
我不确定您的数据有多大以及预期代码应该有多灵活,但我有这个:
zz = '
ID Col
A M
A M
A M
B K
B M
'
df <- read.table(text = zz, header = TRUE)
col = as.data.frame(table(df))
out <- reshape(col, idvar = "ID",
timevar = "Col", direction = "wide")
out
给你:
> out
ID Freq.K Freq.M
1 A 0 3
2 B 1 1
对于第二个数据框:
yy = '
ID Col
C M
C S
'
df1 <- read.table(text = yy, header = TRUE)
col1 = as.data.frame(table(df1))
out1 <- reshape(col1, idvar = "ID",
timevar = "Col", direction = "wide")
out1
你得到:
> out1
ID Freq.M Freq.S
1 C 1 1
然后将它们合并在一起并删除多余的:
ss = merge(out1, out, all.y = T, all.x = T)
ss
ID Freq.M Freq.S Freq.K
1 C 1 1 NA
2 A 3 NA 0
3 B 1 NA 1
答案 2 :(得分:0)
> library(data.table)
> dt=NULL
> dt$ID=c("A","A","A","B","B")
> dt$Col=c("M","M","M","K","M")
> dt=data.frame(dt)
> dt=data.table(dt)
> dt
ID Col
1: A M
2: A M
3: A M
4: B K
5: B M
> a=dt[Col=="M",sum(.N),ID]
> b=dt[Col=="K",sum(.N),ID]
> a
ID V1
1: A 3
2: B 1
> b
ID V1
1: B 1
> setkey(a,ID)
> setkey(b,ID)
> m=b[a]
> m
ID V1 i.V1
1: A NA 3
2: B 1 1
> names(m)=c("ID","Col_K","Col_M")
> m
ID Col_K Col_M
1: A NA 3
2: B 1 1