data.table查找值和翻译

时间:2014-03-14 05:11:55

标签: r data.table lookup

像许多我是R的新手一样。我有一个大型数据集(500M +行),我已经将其放入data.table logStats,其中包含如下数据:

 head(logStats,15)

                   time   pid   mean
 1: 2014-03-10 00:00:00   998 3.570000
 2: 2014-03-10 00:00:00   11 4.090000
 3: 2014-03-10 00:00:00   345 3.380000
 4: 2014-03-10 00:05:00   998 4.866667
 5: 2014-03-10 00:05:00   11 3.677778
 6: 2014-03-10 00:05:00   345 4.487500
 7: 2014-03-10 00:10:00   345 4.833333
 8: 2014-03-10 00:10:00   998 4.333333
 9: 2014-03-10 00:10:00   11 6.977778
10: 2014-03-10 00:15:00   345 3.900000
11: 2014-03-10 00:15:00   998 3.200000
12: 2014-03-10 00:15:00   11 6.030000
13: 2014-03-10 00:20:00   998 4.550000
14: 2014-03-10 00:20:00   11 4.030000
15: 2014-03-10 00:20:00   345 6.060000 

还有第二个非常小的data.table(360行),它有两列用于解码一个' pid'把价值看成更友好的东西。 ' pid'值可以是数字或字符。

例如:

pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))

产生:

   pid  pidName
1: 998    Apple
2:  11  Bannana
3: 345 Cinnamon

我希望表达式能够向data.table logStats添加一个列,其中pidNamepid

我应该得到类似的东西:

                   time pid     mean pidNames
 1: 2014-03-10 00:00:00   998 3.570000 Apple
 2: 2014-03-10 00:00:00   11 4.090000 Banana
 3: 2014-03-10 00:00:00   345 3.380000 Cinnamon
 4: 2014-03-10 00:05:00   998 4.866667 Apple
 5: 2014-03-10 00:05:00   11 3.677778 Banana
 6: 2014-03-10 00:05:00   345 4.487500 Cinnamon
 7: 2014-03-10 00:10:00   345 4.833333 Cinnamon
 8: 2014-03-10 00:10:00   998 4.333333 Apple
 9: 2014-03-10 00:10:00   11 6.977778 Banana
10: 2014-03-10 00:15:00   345 3.900000 Cinnamon
11: 2014-03-10 00:15:00   998 3.200000 Apple
12: 2014-03-10 00:15:00   11 6.030000 Banana
13: 2014-03-10 00:20:00   998 4.550000 Apple
14: 2014-03-10 00:20:00   11 4.030000 Banana
15: 2014-03-10 00:20:00   345 6.060000  Cinnamon

我写了一个函数:

pidNameLookup<-function(x) { 
  return(pidLookupTable[pidLookupTable$pid==x,name]) 
}

然后跑了:

logStats[,pidName:=pidNameLookup(pid)]

但是这只会为前面的值转换前3个NA

   logStats[1:1000]
               date     time pid value           timestamp mean  pidName
      1: 10-03-2014 00:00:12 998   5.5 2014-03-10 00:00:12 3.57    Apple
      2: 10-03-2014 00:00:17  11   2.1 2014-03-10 00:00:17 4.09  Bannana
      3: 10-03-2014 00:00:22 345   5.7 2014-03-10 00:00:22 3.38 Cinnamon
      4: 10-03-2014 00:00:47 998   1.0 2014-03-10 00:00:47 3.57       NA
      5: 10-03-2014 00:00:55  11   0.3 2014-03-10 00:00:55 4.09       NA
      ---                                                                
      996: 10-03-2014 02:49:37 345   0.7 2014-03-10 02:49:37 5.30       NA
      997: 10-03-2014 02:50:01 998   9.9 2014-03-10 02:50:01 5.30       NA
      998: 10-03-2014 02:50:08  11   7.0 2014-03-10 02:50:08 7.00       NA
      999: 10-03-2014 02:50:18 345   2.4 2014-03-10 02:50:18 2.40       NA
     1000: 10-03-2014 02:50:48 998   0.7 2014-03-10 02:50:48 5.30       NA 

并给我一个警告信息:

Warning message:
In pidLookupTable$pid == x 
  longer object length is not a multiple of shorter object length

警告信息和错误结果意味着我做错了。

帮助!!这让我很精神

1 个答案:

答案 0 :(得分:7)

我建议您查看data.tablevignette("datatable-intro"))的简介小结,因为这是为data.table明确构建的内容。

这将为您提供您想要的,并且应该更快,更快:

setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]