像许多我是R的新手一样。我有一个大型数据集(500M +行),我已经将其放入data.table logStats
,其中包含如下数据:
head(logStats,15)
time pid mean
1: 2014-03-10 00:00:00 998 3.570000
2: 2014-03-10 00:00:00 11 4.090000
3: 2014-03-10 00:00:00 345 3.380000
4: 2014-03-10 00:05:00 998 4.866667
5: 2014-03-10 00:05:00 11 3.677778
6: 2014-03-10 00:05:00 345 4.487500
7: 2014-03-10 00:10:00 345 4.833333
8: 2014-03-10 00:10:00 998 4.333333
9: 2014-03-10 00:10:00 11 6.977778
10: 2014-03-10 00:15:00 345 3.900000
11: 2014-03-10 00:15:00 998 3.200000
12: 2014-03-10 00:15:00 11 6.030000
13: 2014-03-10 00:20:00 998 4.550000
14: 2014-03-10 00:20:00 11 4.030000
15: 2014-03-10 00:20:00 345 6.060000
还有第二个非常小的data.table(360行),它有两列用于解码一个' pid'把价值看成更友好的东西。 ' pid'值可以是数字或字符。
例如:
pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))
产生:
pid pidName
1: 998 Apple
2: 11 Bannana
3: 345 Cinnamon
我希望表达式能够向data.table logStats
添加一个列,其中pidName
行pid
。
我应该得到类似的东西:
time pid mean pidNames
1: 2014-03-10 00:00:00 998 3.570000 Apple
2: 2014-03-10 00:00:00 11 4.090000 Banana
3: 2014-03-10 00:00:00 345 3.380000 Cinnamon
4: 2014-03-10 00:05:00 998 4.866667 Apple
5: 2014-03-10 00:05:00 11 3.677778 Banana
6: 2014-03-10 00:05:00 345 4.487500 Cinnamon
7: 2014-03-10 00:10:00 345 4.833333 Cinnamon
8: 2014-03-10 00:10:00 998 4.333333 Apple
9: 2014-03-10 00:10:00 11 6.977778 Banana
10: 2014-03-10 00:15:00 345 3.900000 Cinnamon
11: 2014-03-10 00:15:00 998 3.200000 Apple
12: 2014-03-10 00:15:00 11 6.030000 Banana
13: 2014-03-10 00:20:00 998 4.550000 Apple
14: 2014-03-10 00:20:00 11 4.030000 Banana
15: 2014-03-10 00:20:00 345 6.060000 Cinnamon
我写了一个函数:
pidNameLookup<-function(x) {
return(pidLookupTable[pidLookupTable$pid==x,name])
}
然后跑了:
logStats[,pidName:=pidNameLookup(pid)]
但是这只会为前面的值转换前3个NA
:
logStats[1:1000]
date time pid value timestamp mean pidName
1: 10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple
2: 10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana
3: 10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38 Cinnamon
4: 10-03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA
5: 10-03-2014 00:00:55 11 0.3 2014-03-10 00:00:55 4.09 NA
---
996: 10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA
997: 10-03-2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA
998: 10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA
999: 10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA
1000: 10-03-2014 02:50:48 998 0.7 2014-03-10 02:50:48 5.30 NA
并给我一个警告信息:
Warning message:
In pidLookupTable$pid == x
longer object length is not a multiple of shorter object length
警告信息和错误结果意味着我做错了。
帮助!!这让我很精神
答案 0 :(得分:7)
我建议您查看data.table
(vignette("datatable-intro")
)的简介小结,因为这是为data.table
明确构建的内容。
这将为您提供您想要的,并且应该更快,更快:
setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]