使用R中data.frame的列从转换矩阵中提取值

时间:2013-02-04 19:27:25

标签: r dataframe

我有一个转换矩阵,从一种状态转到另一种状态,例如

cost <- data.frame( a=c("aa","ab"),b=c("ba","bb"))

(假装字符串“aa”是从a移动到a的成本

我有一个data.frame,状态在:

transitions <- data.frame( from=c("a","a","b"), to=c("a","b","b") )

我希望能够为转换添加一个列,每个转换的成本都在,所以它最终成为:

  from to cost
1    a  a   aa
2    a  b   ab
3    b  b   bb

我确信有一种R-ish方法可以做到这一点。我最终使用了for循环:

n <- dim(data)[1]
v <- vector("numeric",n)
for( i in 1:n ) 
{ 
    z<-data[i,c(col1,col2),with=FALSE]
    za <- z[[col1]]
    zb <- z[[col2]]
    v[i] <- dist[za,zb]
}
data <- cbind(data,d=v)
names(data)[dim(data)[2]] <- colName
data

但这感觉非常难看,并且速度非常慢 - 在2M行data.frame上花费大约20分钟(并且计算同一个表的元素之间的距离的操作只需不到一秒)。

是否有一个简单,快速,一个或两个行命令可以获得上面的成本列?

3 个答案:

答案 0 :(得分:3)

更新:考虑已知状态

data.table解决方案:

require(utils)
require(data.table)

## Data generation
N <- 2e6
set.seed(1)
states <- c("a","b")
cost <- data.frame(a=c("aa","ab"),b=c("ba","bb"))
transitions <- data.frame(from=sample(states, N, replace=T), 
                            to=sample(states, N, replace=T))

## Expanded cost matrix construction
f <- expand.grid(states, states)
f <- f[order(f$Var1, f$Var2),]
f$cost <- unlist(cost)

## Prepare data.table
dt <- data.table(transitions)
setkey(dt, from, to)

## Routine itself  
dt[,cost:=as.character("")] # You don't need this line if cost is numeric
apply(f, 1, function(x) dt[J(x[1],x[2]),cost:=x[3]])

transitions中有2M行,需要大约0.3秒才能继续。

答案 1 :(得分:2)

这是一种方式:(至少这个适用于这个例子,我相信它也适用于更大的数据。如果没有,请回复一个例子)

# load both cost and transition with stringsAsFactors = FALSE
# so that strings are NOT by default loaded as factors
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb"), stringsAsFactors=F)
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b"), 
                                      stringsAsFactors = FALSE)

# convert cost to vector: it'll have names a1, a2, b1, b2. we'll exploit that.
cost.vec <- unlist(cost)
# convert "to" to factor and create id column with "from" and as.integer(to)
# the as.integer(to) will convert it into its levels
transitions$to <- as.factor(transitions$to)
transitions$id <- paste0(transitions$from, as.integer(transitions$to))

# now, you'll have a1, a2 etc.. here as well, just match it with the vector
transitions$val <- cost.vec[!is.na(match(names(cost.vec), transitions$id))]

#   from to id val
# 1    a  a a1  aa
# 2    a  b a2  ab
# 3    b  b b2  bb

您当然可以删除id。如果这在任何情况下都不起作用,请告诉我。我会尝试解决它。​​

答案 2 :(得分:2)

从Arun的回答开始,我选择了:

library(reshape)
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb") )
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b") )
row.names(cost) <- c("a","b") #Normally get this from the csv file
cost$from <- row.names(cost)
m <- melt(cost, id.vars=c("from"))
m$transition = paste(m$from,m$variable)
transitions$transition=paste(transitions$from,transitions$to)
merge(m, transitions, by.x="transition",by.y="transition")

还有几行,但我对索引的因子排序有点不信任。这也意味着当它们是data.tables时,我可以这样做:

setkey(m,transition)
setkey(transitions,transition)
m[transitions]

我没有基准测试,但在大型数据集上,我非常有信心data.table合并将比合并或矢量扫描更快。