Question

因此，我目前有一个表示社交网络的数据框，如下所示：

id age  id1    id2   id3   
01  14  02      05    03        
02  23  01      05    03        
03  52  04      01    02        
04  41  03                      
05  32  01      02

理想情况下，我想要一个新的数据框，如下所示：

id age  id1    id2   id3   Connections
01  14  02      05    03        3
02  23  01      05    03        3
03  52  04      01    02        3
04  41  03                      1
05  32  01      02              2

使用新变量，代表“ id”具有的连接数。截至目前，我目前的代码如下：

links <- df
links <- as.matrix(links)
links <- as.data.frame(rbind(links[,c(1,3)], links[,c(1,4)]), links[,c(1,5)])
head(links)

library(igraph)
g = graph.data.frame(links)
m = as.matrix(get.adjacency(g))
m
pmax(rowSums(m), colSums(m))

哪个给我：

 1  2  3  4  5 NA 
 3  3  3  1  2  3

然后如何将其合并到数据框中以创建“ Connections”变量？理想情况下，我的其他数据最多包含50个连接，因此我希望有一种无需重新创建数据框的简便方法。

Answer 1

library(dplyr)
# Toy data
df = data.frame(id = c(1,2,3,4), 
                age = c(1, 1, 1, 1), 
                id1 = c(1, 2, 3, 4), 
                id2 = c(1, 2, 3, NA), 
                id3 = c(1,2, NA, NA))

df$Connections = df %>%
  select(-id, -age) %>% # Remove unnecessary columns
  apply(1, function(row) {
    binary_row = as.numeric(!is.na(row)) # Convert each column to binary
    sum(binary_row) # Return connection count
  })

Answer 2

一种快速的tidyverse方法是将数据重塑为长形，将每个ID具有多少非NA值相加，然后重塑回宽。

library(tidyverse)

df %>%
  gather(key = key, value = val, -id, -age) %>%
  group_by(id, age) %>%
  mutate(connections = sum(!is.na(val))) %>%
  head()
#> # A tibble: 6 x 5
#> # Groups:   id, age [5]
#>   id      age key   val   connections
#>   <chr> <dbl> <chr> <chr>       <int>
#> 1 01       14 id1   02              3
#> 2 02       23 id1   01              3
#> 3 03       52 id1   04              3
#> 4 04       41 id1   03              1
#> 5 05       32 id1   01              2
#> 6 01       14 id2   05              3

df %>%
  gather(key = key, value = val, -id, -age) %>%
  group_by(id, age) %>%
  mutate(connections = sum(!is.na(val))) %>%
  spread(key = key, value = val)
#> # A tibble: 5 x 6
#> # Groups:   id, age [5]
#>   id      age connections id1   id2   id3  
#>   <chr> <dbl>       <int> <chr> <chr> <chr>
#> 1 01       14           3 02    05    03   
#> 2 02       23           3 01    05    03   
#> 3 03       52           3 04    01    02   
#> 4 04       41           1 03    <NA>  <NA> 
#> 5 05       32           2 01    02    <NA>

但是我不会认为您的第一种方法是错误的。由于您正在使用网络，因此使用网络分析工具并计算每个节点的度（与连接数相同）是很有意义的。

Answer 3

那这样的事情呢？

首先，使用regex确定与连接相对应的列

# here connections columns must contain the pattern "id"+digit(s)
connectionsNames <- grepl("id\\d+", names(df), perl = TRUE)

然后我们使用rowSums创建新列

df$connections <- sum(connectionsNames) - rowSums(is.na(df))

这里是结果

df
  id age id1 id2 id3 connections
1  1   1   1   1   1           3
2  2   1   2   2   2           3
3  3   1   3   3  NA           2
4  4   1   4  NA  NA           1

在R中，如何找到给定数据框中的连接数并产生一个表示它的变量？

3 个答案: