使用R

时间:2015-05-15 15:11:44

标签: r matrix

我的数据框由个人和他们居住在某个时间点的城市组成。我想为每年生成一个起始 - 目的地矩阵,记录从一个城市到另一个城市的移动数量。我想知道:

  1. 如何自动在数据集中为每年生成源 - 目标表?
  2. 如何以相同的5x5格式生成所有表格,5是我示例中的城市数量?
  3. 是否有比我在下面提出的更有效的代码?我打算在一个非常大的数据集上运行它。
  4. 考虑以下示例:

    #An example dataframe
    id=sample(1:5,50,T)
    year=sample(2005:2010,50,T)
    city=sample(paste(rep("City",5),1:5,sep=""),50,T)
    df=as.data.frame(cbind(id,year,city),stringsAsFactors=F)
    df$year=as.numeric(df$year)
    df=df[order(df$id,df$year),]
    rm(id,year,city)
    

    我最好的尝试

    #Creating variables
    for(i in 1:length(df$id)){
      df$origin[i]=df$city[i]
      df$destination[i]=df$city[i+1]
      df$move[i]=ifelse(df$orig[i]!=df$dest[i] & df$id[i]==df$id[i+1],1,0) #Checking whether a move has taken place and whether its the same person
      df$year_move[i]=ceiling((df$year[i]+df$year[i+1])/2) #I consider that the person has moved exactly between the two dates at which its location was recorded
    }
    df=df[df$move!=0,c("origin","destination","year_move")]    
    

    为2007创建原始目的地表

    yr07=df[df$year_move==2007,]
    table(yr07$origin,yr07$destination)
    

    结果

            City1 City2 City3 City5
      City1     0     0     1     2
      City2     2     0     0     0
      City5     1     1     0     0
    

2 个答案:

答案 0 :(得分:6)

您可以通过id分割数据,对特定于ID的数据框执行必要的计算以获取该人的所有移动,然后重新组合:

spl <- split(df, df$id)
move.spl <- lapply(spl, function(x) {
  ret <- data.frame(from=head(x$city, -1), to=tail(x$city, -1),
                    year=ceiling((head(x$year, -1)+tail(x$year, -1))/2),
                    stringsAsFactors=FALSE)
  ret[ret$from != ret$to,]
})
(moves <- do.call(rbind, move.spl))
#       from    to year
# 1.1  City4 City2 2007
# 1.2  City2 City1 2008
# 1.3  City1 City5 2009
# 1.4  City5 City4 2009
# 1.5  City4 City2 2009
# ...

因为这段代码对每个id都使用了矢量化计算,所以它比在提供的代码中循环遍历数据帧的每一行要快得多。

现在,您可以使用splittable抓取年份特定的5x5移动矩阵:

moves$from <- factor(moves$from)
moves$to <- factor(moves$to)
lapply(split(moves, moves$year), function(x) table(x$from, x$to))
# $`2005`
#        
#         City1 City2 City3 City4 City5
#   City1     0     0     0     0     1
#   City2     0     0     0     0     0
#   City3     0     0     0     0     0
#   City4     0     0     0     0     0
#   City5     0     0     1     0     0
# 
# $`2006`
#        
#         City1 City2 City3 City4 City5
#   City1     0     0     0     1     0
#   City2     0     0     0     0     0
#   City3     1     0     0     1     0
#   City4     0     0     0     0     0
#   City5     2     0     0     0     0
# ...

答案 1 :(得分:0)

您可以使用reshape2的dcast和循环来执行此操作。

library(reshape2)

# write function
write_matrices <- function(year){
  mat <- dcast(subset(df, df$year_move == year), origin ~ destination)
  print(year)  
  print(mat)
}

# get unique list of years (there was an NA in there, so that's why this is longer than it needs to be
years <- unique(subset(df, is.na(df$year_move) == FALSE)$year_move)

# loop though and get results
for (year in years){
  write_matrices(year)
}

唯一要解决的问题是每个矩阵都必须具有5 * 5,因为如果某些年份没有全部5个城市,则仅显示该年的城市。

您可以通过以下步骤解决此问题:首先将观测值转换为频率表,因此将其包括在内但为零。