我有一个数据框,其中包含每天住户出行的信息。
df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)
hid
:住户编号(实际数据框中还有其他住户)
mid
:家庭成员ID
thc
:表示成员日常活动顺序的字符串;
0 =内部住宅,1 =他/他去过的地方的唯一ID
因此,如果编码为01020
,则表示他/她从家(0)到了地方1
,然后又回到家(0),从主页(0),然后在一天之内返回首页(0)。
2
中的 ID分为每列hid
,htc1
,htc2
,htc3
和htc4
。 htc5
的最大数量是根据家庭中移动的最大长度来设置的。
如果一个成员中的最大代码为5,其他成员的最大代码为3,则thc
将其他成员的htc4
和'htc5'填充。
NA
:该变量指示在该地点进行的活动的属性。例如,1 =工作,2 =学校。在后面的几列中也将其拆分。
现在,我要获取的是一个列表,其中包含用于mdc
的{{1}}和adjacency matrix
,即node list
,其中包含{ {1}}。
这是理想的结果:
network analysis
根据我目前对列表和数据转换的了解,将igraph
转换为所需的列表非常复杂。例如,要创建df
,我需要前后引用# Desired list
[1] # It represents first element grouped by `hid`.
# In the actual data frame, there are around 40,000
# households which contains different `hid`.
$hid # `hid` of each record
[1]10001
[2]10001
[3]10001
[4]10001
$mid # `mid` of each record
[1]1
[2]2
[3]3
[4]4
$trip # `adjacency matrix` of each `mid`
# head of line indicates destination area id
# leftmost column indicates origin area id
# for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0
[1] # It represents `mid`=1
0 1
0 0 1
1 1 0
[2] # It represents `mid`=2
0 1
0 0 2
1 2 0
[3]
0
0 0
[4]
0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0
$node # Attribute of each area defined in `mdc'
# for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice
# in area id '2' and `3` as indicated in `thc` and `thc1-4`.
# The number does not indicate "how many times s/he took activity in the area"
# but indicates "what s/he did in the area"
area mdc1 mdc2 mdc3 mdc4
0 0 0 0 0
1 0 1 NA NA
2 NA NA NA 2
3 NA NA NA 2
[2] # Next element continues same information of other hid
# Thus, from `hid` to `mdc` are one set of attributes of one element
中的信息。对于df
,还需要获取最大数目的区域ID,并将信息存储在“ mdc或mdc1-5”中。
非常感谢您能提出任何建议来开始这项工作。
我更喜欢使用adjacency matrix
,thc or thc1-5
及其家人,但我没有使用node
进行列表操作。我以前使用格式化程序进行数据操作,但不熟悉列表操作。
执行此操作后,我将以tidyverse
或purrr
或purrr
之类的其他软件包中可视化每个家庭(非成员)的移动和活动模式,以从每个模式。
答案 0 :(得分:1)
这里有两个辅助函数可以构建邻接矩阵和活动矩阵:##构造邻接矩阵(注释中的详细信息)
build_adj_mat <- function(thc_) {
# Convert the factor to numeric for processing
if (is.factor(thc_)) {
thc_ <- as.numeric(unlist(strsplit(as.character(thc_), "")))
}
# Create a matrix with the correc dimensions, and give names
mat <- matrix(0, nrow = max(thc_) + 1, ncol = max(thc_) + 1)
rownames(mat) <- colnames(mat) <- seq(min(thc_), max(thc_))
# Add to the matrix when appropriate
for (i in 1:(length(thc_) - 1)) {
from = thc_[i] + 1
to = thc_[i + 1] + 1
mat[from, to] <- mat[from, to] + 1
}
return(mat)
}
## Build the activity matrix / node
build_node_df <- function(df_) {
# get the maximum area length
max_len <-
max(as.numeric(unlist(strsplit(
as.character(df_$thc), ""
))))
# Build the actual matrix function
build_act_mat <- function(loc_, act_, max = max_len) {
if (is.factor(loc_)) {
loc_ <- as.numeric(unlist(strsplit(as.character(loc_), "")))
}
if (is.factor(act_)) {
act_ <- as.numeric(unlist(strsplit(as.character(act_), "")))
}
area = rep(NA, max + 1)
for (i in 1:length(loc_)) {
area[loc_[i] + 1] <- act_[i]
}
return(area)
}
# Call the function
out <- mapply(build_act_mat, df_$thc, df_$mdc)
# cbind the output with the areas
out <- data.frame(cbind(0:max_len, out))
# Assign proper column names
colnames(out) <- c("area", paste("mid_", df_$mid, sep = ""))
return(out)
}
然后是一个将这些功能应用于df
的函数,并为您的hid
和mid
输出添加一些功能:
build_list <- function(dfo) {
hid_ <- as.numeric(as.character(dfo$hid))
mid_ <- as.numeric(as.character(dfo$mid))
trip_ <- lapply(dfo$thc, build_adj_mat)
node_ <- build_node_df(dfo)
return(list(
hid = hid_,
mid = mid_,
trip = trip_,
node = node_)
)
}
输出:
> build_list(df)
$hid
[1] 10001 10001 10001 10001
$mid
[1] 1 2 3 4
$trip
$trip[[1]]
0 1
0 0 1
1 1 0
$trip[[2]]
0 1
0 0 2
1 2 0
$trip[[3]]
0
0 0
$trip[[4]]
0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0
$node
area mid_1 mid_2 mid_3 mid_4
1 0 0 0 0 0
2 1 0 1 NA NA
3 2 NA NA NA 2
4 3 NA NA NA 2
我敢肯定有一种方法可以使它与dplyr
一起使用,但是仅从基础split
开始使用R
可能会更容易。有了这个稍作修改的数据框:
df2 <- data.frame(
hid = c("10001", "10002", "10002", "10003"),
mid = c(1, 2, 3, 4),
thc = c("010", "01010", "0", "02030"),
mdc = c("000", "01010", "0", "02020")
)
现在将新数据框拆分为一个列表,并使用lapply
将build_list
函数应用于每个片段:
split_df2 <- split(df2, df2$hid)
names(split_df2) <- paste("hid_", names(split_df2), sep = "")
lapply(split_df2, build_list)
输出:
$hid_10001
$hid_10001$hid
[1] 10001
$hid_10001$mid
[1] 1
$hid_10001$trip
$hid_10001$trip[[1]]
0 1
0 0 1
1 1 0
$hid_10001$node
area mid_1
1 0 0
2 1 0
$hid_10002
$hid_10002$hid
[1] 10002 10002
$hid_10002$mid
[1] 2 3
$hid_10002$trip
$hid_10002$trip[[1]]
0 1
0 0 2
1 2 0
...
...
希望您将方向指向正确!