从R

时间:2017-09-05 23:16:19

标签: r data-manipulation

我有一个非常混乱的数据框,填充了NA和太多的日期列,我必须用于机器学习/预测问题。数据框有许多类日期列,它们具有相似但不相同的信息。我想通过算法找到一种方法:

  • 识别非常相似的列
  • 一旦识别出类似的一对,就选择要删除的两列中的哪一列

我目前采用了一种非常天真的方法,并在下面有一个测试数据框,以便与我分享我的方法:

mydf = structure(list(last.activity.date = structure(c(17407, NA, NA, 
17333, 17338, 17357, 17388, 17350, NA, 17322, 17406, NA, 17336, 
NA, NA, NA, NA, NA, 17323, NA, NA, NA, NA, 17375, 17401, 17325, 
17400, NA, NA, 17380), class = "Date"), last.contacted = structure(c(17406, 
NA, NA, 17333, NA, 17357, 17388, 17350, NA, 17322, 17406, NA, 
17336, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17354, 17401, 
17325, 17382, NA, NA, 17380), class = "Date"), create.date = structure(c(17319, 
17319, 17319, 17319, 17319, 17319, 17319, 17319, 17320, 17319, 
17320, 17320, 17320, 17320, 17321, 17321, 17322, 17322, 17322, 
17322, 17322, 17322, 17322, 17322, 17322, 17323, 17322, 17323, 
17323, 17323), class = "Date"), became.a.subscriber.date = structure(c(17319, 
17319, NA, 17346, NA, 17319, 17319, 17319, 17320, 17319, NA, 
17320, 17346, 17320, 17321, 17321, 17322, 17322, 17346, 17322, 
17322, 17322, 17322, 17322, NA, 17323, 17322, 17323, 17323, 17323
), class = "Date"), first.email.send.date = structure(c(17319, 
17319, 17319, 17319, 17320, 17319, 17319, 17319, 17320, 17319, 
17320, 17320, 17320, 17320, 17321, 17321, 17322, 17322, 17322, 
17322, 17322, 17322, 17322, 17322, 17322, 17323, 17322, 17323, 
17323, 17323), class = "Date"), last.email.click.date = structure(c(17320, 
NA, 17357, 17338, 17323, 17345, 17358, 17350, NA, 17319, NA, 
17345, 17336, NA, NA, NA, NA, NA, 17327, 17328, NA, NA, NA, 17359, 
17379, 17325, 17323, NA, NA, 17379), class = "Date")), .Names = c("last.activity.date", 
"last.contacted", "create.date", "became.a.subscriber.date", 
"first.email.send.date", "last.email.click.date"), row.names = c(NA, 
30L), class = "data.frame")

我目前的做法是做一个嵌套的for循环,每次抓取两列,并检查当两列都是NA时哪些%的值相似,并将输出保存到相似性矩阵中。这是我的方法:

sim.df = matrix(data = 0, nrow = ncol(mydf), ncol = ncol(mydf))

for(i in 1:ncol(mydf)) {
  for(j in 1:ncol(mydf)) {

    # skip diagonal elements
    if(i == j) { sim.df[i,j] = 0; next }

    # grab two columns
    a = mydf[,i]
    b = mydf[,j]

    # compute similarity accuracy when both columns are not NA
    idxs = !is.na(a) & !is.na(b)
    sims = sum(a[idxs] == b[idxs]) / sum(idxs)

    # update output matrix    
    sim.df[i,j] = sims
  }
}

这是我的sim.df输出方法:

round(sim.df, 3)
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,] 0.000 0.769 0.000 0.000 0.000 0.214
[2,] 0.769 0.000 0.000 0.000 0.000 0.250
[3,] 0.000 0.000 0.000 0.885 0.967 0.059
[4,] 0.000 0.000 0.885 0.000 0.885 0.071
[5,] 0.000 0.000 0.967 0.885 0.000 0.059
[6,] 0.214 0.250 0.059 0.071 0.059 0.000

在这种情况下,第1列和第2列与3-5相似,而第6列与其余列相比相当独特。对于第1列和第2列,它们是相似的,但第2列有几个NAs,其中第1列有值,但反之则不然,所以我可能更喜欢保留第1列。在第3-5列的情况下,第4列有NA而3和5没有,我可能会在选择两者之间无动于衷。我还会保留第6栏。

我的实际数据框有~40个日期列,很难使这些只是保持/删除40x40相似度矩阵的列推断。我正在寻找一个更好的方法来解决这个问题。

谢谢!

0 个答案:

没有答案