使用bigmemory技术和并行计算找到最频繁的项目?

时间:2014-05-09 19:31:02

标签: r csv

如何在不使用回归的情况下找出哪些月份的延迟最频繁?以下csv是100MB文件的示例。我知道我应该使用bigmemory技术,但不知道如何处理这个问题。这里几个月存储为整数而不是因子。

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2006,1,11,3,743,745,1024,1018,US,343,N657AW,281,273,223,6,-2,ATL,PHX,1587,45,13,0,,0,0,0,0,0,0
2006,1,11,3,1053,1053,1313,1318,US,613,N834AW,260,265,214,-5,0,ATL,PHX,1587,27,19,0,,0,0,0,0,0,0
2006,1,11,3,1915,1915,2110,2133,US,617,N605AW,235,258,220,-23,0,ATL,PHX,1587,4,11,0,,0,0,0,0,0,0
2006,1,11,3,1753,1755,1925,1933,US,300,N312AW,152,158,126,-8,-2,AUS,PHX,872,16,10,0,,0,0,0,0,0,0
2006,1,11,3,824,832,1015,1015,US,765,N309AW,171,163,132,0,-8,AUS,PHX,872,27,12,0,,0,0,0,0,0,0
2006,1,11,3,627,630,834,832,US,295,N733UW,127,122,108,2,-3,BDL,CLT,644,6,13,0,,0,0,0,0,0,0
2006,1,11,3,825,820,1041,1021,US,349,N177UW,136,121,111,20,5,BDL,CLT,644,4,21,0,,0,0,0,20,0,0
2006,1,11,3,942,945,1155,1148,US,356,N404US,133,123,121,7,-3,BDL,CLT,644,4,8,0,,0,0,0,0,0,0
2006,1,11,3,1239,1245,1438,1445,US,775,N722UW,119,120,103,-7,-6,BDL,CLT,644,4,12,0,,0,0,0,0,0,0
2006,1,11,3,1642,1645,1841,1845,US,1002,N104UW,119,120,105,-4,-3,BDL,CLT,644,4,10,0,,0,0,0,0,0,0
2006,1,11,3,1836,1835,NA,2035,US,1103,N425US,NA,120,NA,NA,1,BDL,CLT,644,0,17,0,,1,0,0,0,0,0
2006,1,11,3,NA,1725,NA,1845,US,69,0,NA,80,NA,NA,NA,BDL,DCA,313,0,0,1,A,0,0,0,0,0,0

2 个答案:

答案 0 :(得分:2)

假设您的data.frame名为dd。如果您想查看所有年份中每个月的天气延迟总数,您可以这样做

delay <- aggregate(WeatherDelay~Month, dd, sum)
delay[order(-delay$WeatherDelay),]

答案 1 :(得分:1)

这更接近你想要的吗?我不太清楚R对行进行求和,但这至少会聚合它们。我也在学习!

delays <- read.csv("tmp.csv", stringsAsFactors = FALSE)

delay <- aggregate(cbind(ArrDelay, DepDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) ~ Month, delays, sum)
delay

输出:

  Month ArrDelay DepDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
1     1       10      -16            0        0             0                 0
2     2      -31       -2            0        0             0                 0
3     3        9       -4            0       20             0                 0

注意:我更改了您的文档,以便在Months列中提供一些多样性:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2006,1,11,3,743,745,1024,1018,US,343,N657AW,281,273,223,6,-2,ATL,PHX,1587,45,13,0,,0,0,0,0,0,0
2006,1,11,3,1053,1053,1313,1318,US,613,N834AW,260,265,214,-5,0,ATL,PHX,1587,27,19,0,,0,0,0,0,0,0
2006,2,11,3,1915,1915,2110,2133,US,617,N605AW,235,258,220,-23,0,ATL,PHX,1587,4,11,0,,0,0,0,0,0,0
2006,2,11,3,1753,1755,1925,1933,US,300,N312AW,152,158,126,-8,-2,AUS,PHX,872,16,10,0,,0,0,0,0,0,0
2006,1,11,3,824,832,1015,1015,US,765,N309AW,171,163,132,0,-8,AUS,PHX,872,27,12,0,,0,0,0,0,0,0
2006,1,11,3,627,630,834,832,US,295,N733UW,127,122,108,2,-3,BDL,CLT,644,6,13,0,,0,0,0,0,0,0
2006,3,11,3,825,820,1041,1021,US,349,N177UW,136,121,111,20,5,BDL,CLT,644,4,21,0,,0,0,0,20,0,0
2006,1,11,3,942,945,1155,1148,US,356,N404US,133,123,121,7,-3,BDL,CLT,644,4,8,0,,0,0,0,0,0,0
2006,3,11,3,1239,1245,1438,1445,US,775,N722UW,119,120,103,-7,-6,BDL,CLT,644,4,12,0,,0,0,0,0,0,0
2006,3,11,3,1642,1645,1841,1845,US,1002,N104UW,119,120,105,-4,-3,BDL,CLT,644,4,10,0,,0,0,0,0,0,0
2006,3,11,3,1836,1835,NA,2035,US,1103,N425US,NA,120,NA,NA,1,BDL,CLT,644,0,17,0,,1,0,0,0,0,0
2006,1,11,3,NA,1725,NA,1845,US,69,0,NA,80,NA,NA,NA,BDL,DCA,313,0,0,1,A,0,0,0,0,0,0