如何通过复杂条件将列转换为存储在列表中的矩阵和表[R]

时间:2018-07-12 15:47:50

标签: r dplyr igraph data-conversion purrr

我有一个数据框,其中包含每天住户出行的信息。

df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)

hid:住户编号(实际数据框中还有其他住户)
mid:家庭成员ID
thc:表示成员日常活动顺序的字符串;
0 =内部住宅,1 =他/他去过的地方的唯一ID

因此,如果编码为01020,则表示他/她从家(0)到了地方1,然后又回到家(0),从主页(0),然后在一天之内返回首页(0)。

2中的

ID分为每列hidhtc1htc2htc3htc4htc5的最大数量是根据家庭中移动的最大长度来设置的。
如果一个成员中的最大代码为5,其他成员的最大代码为3,则thc将其他成员的htc4和'htc5'填充。

NA:该变量指示在该地点进行的活动的属性。例如,1 =工作,2 =学校。在后面的几列中也将其拆分。

现在,我要获取的是一个列表,其中包含用于mdc的{​​{1}}和adjacency matrix,即node list,其中包含{ {1}}。

这是理想的结果:

network analysis

根据我目前对列表和数据转换的了解,将igraph转换为所需的列表非常复杂。例如,要创建df,我需要前后引用# Desired list [1] # It represents first element grouped by `hid`. # In the actual data frame, there are around 40,000 # households which contains different `hid`. $hid # `hid` of each record [1]10001 [2]10001 [3]10001 [4]10001 $mid # `mid` of each record [1]1 [2]2 [3]3 [4]4 $trip # `adjacency matrix` of each `mid` # head of line indicates destination area id # leftmost column indicates origin area id # for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0 [1] # It represents `mid`=1 0 1 0 0 1 1 1 0 [2] # It represents `mid`=2 0 1 0 0 2 1 2 0 [3] 0 0 0 [4] 0 1 2 3 0 0 0 1 1 1 0 0 0 0 2 1 0 0 0 3 1 0 0 0 $node # Attribute of each area defined in `mdc' # for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice # in area id '2' and `3` as indicated in `thc` and `thc1-4`. # The number does not indicate "how many times s/he took activity in the area" # but indicates "what s/he did in the area" area mdc1 mdc2 mdc3 mdc4 0 0 0 0 0 1 0 1 NA NA 2 NA NA NA 2 3 NA NA NA 2 [2] # Next element continues same information of other hid # Thus, from `hid` to `mdc` are one set of attributes of one element 中的信息。对于df,还需要获取最大数目的区域ID,并将信息存储在“ mdc或mdc1-5”中。
非常感谢您能提出任何建议来开始这项工作。

我更喜欢使用adjacency matrixthc or thc1-5及其家人,但我没有使用node进行列表操作。我以前使用格式化程序进行数据操作,但不熟悉列表操作。

执行此操作后,我将以tidyversepurrrpurrr之类的其他软件包中可视化每个家庭(非成员)的移动和活动模式,以从每个模式。

1 个答案:

答案 0 :(得分:1)

这里有两个辅助函数可以构建邻接矩阵和活动矩阵:##构造邻接矩阵(注释中的详细信息)

build_adj_mat <- function(thc_) {
  # Convert the factor to numeric for processing
  if (is.factor(thc_)) {
    thc_ <- as.numeric(unlist(strsplit(as.character(thc_), "")))
  }

  # Create a matrix with the correc dimensions, and give names
  mat <- matrix(0, nrow = max(thc_) + 1, ncol = max(thc_) + 1)
  rownames(mat) <- colnames(mat) <- seq(min(thc_), max(thc_))

  # Add to the matrix when appropriate
  for (i in 1:(length(thc_) - 1)) {
    from = thc_[i] + 1
    to = thc_[i + 1] + 1
    mat[from, to] <- mat[from, to] + 1
  }
  return(mat)
}


## Build the activity matrix / node

build_node_df <- function(df_) {
  # get the maximum area length
  max_len <-
    max(as.numeric(unlist(strsplit(
      as.character(df_$thc), ""
    ))))
  # Build the actual matrix function
  build_act_mat <- function(loc_, act_, max = max_len) {
    if (is.factor(loc_)) {
      loc_ <- as.numeric(unlist(strsplit(as.character(loc_), "")))
    }
    if (is.factor(act_)) {
      act_ <- as.numeric(unlist(strsplit(as.character(act_), "")))
    }
    area = rep(NA, max + 1)
    for (i in 1:length(loc_)) {
      area[loc_[i] + 1] <- act_[i]
    }
    return(area)
  }
  # Call the function
  out <- mapply(build_act_mat, df_$thc, df_$mdc)
  # cbind the output with the areas
  out <- data.frame(cbind(0:max_len, out))
  # Assign proper column names
  colnames(out) <- c("area", paste("mid_", df_$mid, sep = ""))
  return(out)
}

然后是一个将这些功能应用于df的函数,并为您的hidmid输出添加一些功能:

build_list <- function(dfo) {
  hid_ <- as.numeric(as.character(dfo$hid))
  mid_ <- as.numeric(as.character(dfo$mid))
  trip_ <- lapply(dfo$thc, build_adj_mat)
  node_ <- build_node_df(dfo)

  return(list(
    hid = hid_,
    mid = mid_,
    trip = trip_,
    node = node_)
    )
}

输出:

> build_list(df)
$hid
[1] 10001 10001 10001 10001

$mid
[1] 1 2 3 4

$trip
$trip[[1]]
  0 1
0 0 1
1 1 0

$trip[[2]]
  0 1
0 0 2
1 2 0

$trip[[3]]
  0
0 0

$trip[[4]]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0


$node
  area mid_1 mid_2 mid_3 mid_4
1    0     0     0     0     0
2    1     0     1    NA    NA
3    2    NA    NA    NA     2
4    3    NA    NA    NA     2

我敢肯定有一种方法可以使它与dplyr一起使用,但是仅从基础split开始使用R可能会更容易。有了这个稍作修改的数据框:

df2 <- data.frame(
  hid = c("10001", "10002", "10002", "10003"),
  mid = c(1, 2, 3, 4),
  thc = c("010", "01010", "0", "02030"),
  mdc = c("000", "01010", "0", "02020")
)

现在将新数据框拆分为一个列表,并使用lapplybuild_list函数应用于每个片段:

split_df2 <- split(df2, df2$hid)
names(split_df2) <- paste("hid_", names(split_df2), sep = "")
lapply(split_df2, build_list)

输出:

$hid_10001
$hid_10001$hid
[1] 10001

$hid_10001$mid
[1] 1

$hid_10001$trip
$hid_10001$trip[[1]]
  0 1
0 0 1
1 1 0


$hid_10001$node
  area mid_1
1    0     0
2    1     0


$hid_10002
$hid_10002$hid
[1] 10002 10002

$hid_10002$mid
[1] 2 3

$hid_10002$trip
$hid_10002$trip[[1]]
  0 1
0 0 2
1 2 0
...
...

希望您将方向指向正确!