如何在R中对此进行矢量化

时间:2014-02-27 02:28:32

标签: r

我有两张关于交通流量的数据表。我试图(最终)将它们组合成按里程标记的交通线性进展情节。例如:

mileposts <- structure(list(city = c("city1", "city2", "city3", "city4"), 
milepost = c(0L, 50L, 120L, 250L)), .Names = c("city", "milepost"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

   city milepost
1 city1        0
2 city2       50
3 city3      120
4 city4      250


traffic <- structure(list(citypair = c("city1-city2", "city2-city4", "city1-city3", 
"city1-city4", "city3-city4"), traffic = c(610L, 23L, 139L, 88L, 
17L), origmp = c(0L, 50L, 0L, 0L, 120L), destmp = c(50L, 250L, 
120L, 250L, 250L)), .Names = c("citypair", "traffic", "origmp", 
"destmp"), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))

   citypair        traffic   origmp  destmp
1 city1-city2        610      0       50
2 city2-city4        23       50      250
3 city1-city3       139       0       120
4 city1-city4        88       0       250
5 city3-city4        17       120     250

我想要的是在“里程碑”表中添加一个列“卷”,列出从该城市开始或经过该城市的所有流量(城市按1-2-3-4的顺序排列)。例如,city3的数量将是来自流量[c(2,4,5),2]的值的总和。

我该怎么做?我知道它必须是某种for循环。我尝试了一个循环,在traffic$traffic to mileposts$vol条件traffic$origmp[i] >= mileposts$mileposttraffic$destmp[i] <= mileposts$milepost上添加"the condition has length > 1 and only the first element will be used"值,但我收到错误[j]。但是,如果我将整个事物包裹在mileposts$milepost上的{{1}}维度上,则整个运行变得非常慢。有关如何有效加快/编码的任何建议?

更一般地说,我想我正在问如何以有效的方式使用两个数据帧之间的数据来执行条件操作(即,不循环遍历两个数据帧的每一行)。谢谢!

3 个答案:

答案 0 :(得分:1)

这有点令人费解,但它确实有效:

cityorder <- c("city1","city2","city3","city4")
through <- lapply(strsplit(traffic$citypair,"-"),match,cityorder)
through <- lapply(through,function(x) seq(x[1],x[2]-1))

citymatch <- sapply(mileposts$city, grep, cityorder)
sum.ids <- lapply(citymatch, function(x)  sapply(through, function(y) x %in% y) )
mileposts$traffic <- sapply(sum.ids, function(x) sum(traffic$traffic[x]) )

#   city milepost traffic
#1 city1        0     837
#2 city2       50     250
#3 city3      120     128
#4 city4      250       0

结果以预期结果结帐“ city3的数量将是来自流量[c(2,4,5),2]的值的总和”

sum(traffic[c(2, 4, 5),2])
#[1] 128

答案 1 :(得分:0)

使用您的两张表 - milepoststraffic已经在内存中,我可以使用下面的代码获得您想要的结果 -

library(data.table)

# building index of which route traffic is to be associated with which city
uniquecities <- unique(mileposts$milepost)
uniqueCityCombns <- data.table(expand.grid(uniquecities,uniquecities,uniquecities))
setnames(uniqueCityCombns, c('origmp','destmp','milepost'))
uniqueCityCombns <- uniqueCityCombns[origmp < destmp & milepost < destmp]
uniqueCityCombns <- data.table(uniqueCityCombns <- uniqueCityCombns[origmp <= milepost])

# calculating traffic passing through the city
uniqueCityCombnsTrf <- merge(uniqueCityCombns,traffic, by = c('origmp','destmp'))
uniqueCityCombnsTrf <- uniqueCityCombnsTrf [,list(traffic = sum(traffic)), by = 'milepost']
uniqueCityCombnsTrf <- merge(uniqueCityCombnsTrf , mileposts, by = 'milepost')

输出 -

> uniqueCityCombnsTrf 
   milepost traffic  city
1:        0     837 city1
2:       50     250 city2
3:      120     128 city3

答案 2 :(得分:0)

traffic$start  <-  as.numeric(gsub("city|-city.+$", "", traffic$citypair) )
traffic$end    <-  as.numeric(gsub("city[[:digit:]]*|-city", "", traffic$citypair) )
sapply(mileposts$city, function(cit) {n=as.numeric(sub("city","",cit))
                    sum(traffic$traffic*( (n >= traffic$start) & n < traffic$end) )} )
#---------
city1 city2 city3 city4 
  837   250   128     0