我有一个非常混乱的数据框,填充了NA和太多的日期列,我必须用于机器学习/预测问题。数据框有许多类日期列,它们具有相似但不相同的信息。我想通过算法找到一种方法:
我目前采用了一种非常天真的方法,并在下面有一个测试数据框,以便与我分享我的方法:
mydf = structure(list(last.activity.date = structure(c(17407, NA, NA,
17333, 17338, 17357, 17388, 17350, NA, 17322, 17406, NA, 17336,
NA, NA, NA, NA, NA, 17323, NA, NA, NA, NA, 17375, 17401, 17325,
17400, NA, NA, 17380), class = "Date"), last.contacted = structure(c(17406,
NA, NA, 17333, NA, 17357, 17388, 17350, NA, 17322, 17406, NA,
17336, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17354, 17401,
17325, 17382, NA, NA, 17380), class = "Date"), create.date = structure(c(17319,
17319, 17319, 17319, 17319, 17319, 17319, 17319, 17320, 17319,
17320, 17320, 17320, 17320, 17321, 17321, 17322, 17322, 17322,
17322, 17322, 17322, 17322, 17322, 17322, 17323, 17322, 17323,
17323, 17323), class = "Date"), became.a.subscriber.date = structure(c(17319,
17319, NA, 17346, NA, 17319, 17319, 17319, 17320, 17319, NA,
17320, 17346, 17320, 17321, 17321, 17322, 17322, 17346, 17322,
17322, 17322, 17322, 17322, NA, 17323, 17322, 17323, 17323, 17323
), class = "Date"), first.email.send.date = structure(c(17319,
17319, 17319, 17319, 17320, 17319, 17319, 17319, 17320, 17319,
17320, 17320, 17320, 17320, 17321, 17321, 17322, 17322, 17322,
17322, 17322, 17322, 17322, 17322, 17322, 17323, 17322, 17323,
17323, 17323), class = "Date"), last.email.click.date = structure(c(17320,
NA, 17357, 17338, 17323, 17345, 17358, 17350, NA, 17319, NA,
17345, 17336, NA, NA, NA, NA, NA, 17327, 17328, NA, NA, NA, 17359,
17379, 17325, 17323, NA, NA, 17379), class = "Date")), .Names = c("last.activity.date",
"last.contacted", "create.date", "became.a.subscriber.date",
"first.email.send.date", "last.email.click.date"), row.names = c(NA,
30L), class = "data.frame")
我目前的做法是做一个嵌套的for循环,每次抓取两列,并检查当两列都是NA时哪些%的值相似,并将输出保存到相似性矩阵中。这是我的方法:
sim.df = matrix(data = 0, nrow = ncol(mydf), ncol = ncol(mydf))
for(i in 1:ncol(mydf)) {
for(j in 1:ncol(mydf)) {
# skip diagonal elements
if(i == j) { sim.df[i,j] = 0; next }
# grab two columns
a = mydf[,i]
b = mydf[,j]
# compute similarity accuracy when both columns are not NA
idxs = !is.na(a) & !is.na(b)
sims = sum(a[idxs] == b[idxs]) / sum(idxs)
# update output matrix
sim.df[i,j] = sims
}
}
这是我的sim.df输出方法:
round(sim.df, 3)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.000 0.769 0.000 0.000 0.000 0.214
[2,] 0.769 0.000 0.000 0.000 0.000 0.250
[3,] 0.000 0.000 0.000 0.885 0.967 0.059
[4,] 0.000 0.000 0.885 0.000 0.885 0.071
[5,] 0.000 0.000 0.967 0.885 0.000 0.059
[6,] 0.214 0.250 0.059 0.071 0.059 0.000
在这种情况下,第1列和第2列与3-5相似,而第6列与其余列相比相当独特。对于第1列和第2列,它们是相似的,但第2列有几个NAs,其中第1列有值,但反之则不然,所以我可能更喜欢保留第1列。在第3-5列的情况下,第4列有NA而3和5没有,我可能会在选择两者之间无动于衷。我还会保留第6栏。
我的实际数据框有~40个日期列,很难使这些只是保持/删除40x40相似度矩阵的列推断。我正在寻找一个更好的方法来解决这个问题。
谢谢!