Question

如果这是一个问题很少的话我道歉。这是我第一次询问StackOverflow。

我有关于应用程序的使用数据，我试图将其转换为热图，以显示跨应用程序的用户之间的重叠使用。我很难将数据转换为适合在corrplot（我首选的热图可视化包）中可视化热图的格式。

数据被格式化，因此每个可能的应用程序使用组合都表示为一行（例如，仅app1，仅app2，app1 + app2，app3，app1 + app3，app2 + app3，app1 + app2 + app3等）落入应用程序使用特定配置的相应用户数（例如，曾经使用过app1和app3的用户将为该特定行贡献1个。）

使用应用程序启动数据的示例：

df.start <- data.frame(appset = c("[app1]","[app2]","[app3]","[app1;app2]","[app2;app3]","[app1;app3]","[app1;app2;app3"]),
                       unique_users = c(1000, 400, 150, 300, 30, 130,10))

我希望最终将数据转换为具有以下属性的表单：
1）每个行和列代表一个应用程序（如相关矩阵），因此对于3个应用程序集，它应该是3x3矩阵，其中行是“app1”和“app1”。＆＃39; APP2＆＃39; ＆＃39; APP3＆＃39;而且这些列也是“app1＆＃39; ＆＃39; APP2＆＃39; ＆＃39; APP3＆＃39;
2）每行按照该行上应用程序的用户总数进行标准化，以便数字代表column.app/row.app的比率，告诉我们使用行应用程序的用户百分比也使用列应用程序（如果更容易按列标准化，这也很好）

我的目标是看起来像这样：

df.end <- data.frame(app1 = c(1, 310/1440, 140/1440),
                     app2 = c(310/740, 1, 40/740),
                     app3 = c(140/320, 40/320, 1))
row.names(df.end) <- c('app1','app2','app3')

（我将这些数字作为比例包括在＆＃39; 300/1430＆＃39;以展示我想要在每一行上进行的计算，以规范化数据，但最终该值应该显示为.20979 in该实例;它如何通过运行该代码出现在R中是我希望它出现的方式）

我没有以获取该格式的数据结婚，我最终只需要一种方法来可视化应用程序之间的交叉使用关系，并且热图在过去为这些目的提供了很好的帮助。我需要的是：
1）使用它们的名称自动检测数据中的应用程序以生成矩阵的行和列（因为我不仅仅有3个示例应用程序，并且希望针对不同目的重新运行各种感兴趣的应用程序组合的代码）
2）数字表示为应用之间的比率，以便在数据中的某个位置表示两个方向（例如，也使用app2的app1用户的比例以及也使用app1的app2用户的比率）。

我已经手工完成了单个细胞的计算（复制并粘贴到结果中以符合我需要的形式），但这显然是一种可重复的结果和应用于新数据集的不良方法。 / p>

将应用程序集分隔为我开始的列：

df.start <- mutate(df.start, 
                   app1 = ifelse(grepl("app1", df.start$appset),TRUE,FALSE),
                   app2 = ifelse(grepl("app2", df.start$appset),TRUE,FALSE),
                   app3 = ifelse(grepl("app3", df.start$appset),TRUE,FALSE))

查找每个用户的唯一用户总数（以便稍后规范化行）：

total_app1 <- sum(df.start$unique_users[df.start$app1])
total_app2 <- sum(df.start$unique_users[df.start$app2])
total_app3 <- sum(df.start$unique_users[df.start$app3])

然后手动生成标准化数据的单个单元格以复制并粘贴到excel：

sum(df.start$unique_users[df.start$app1 & df.start$app1])/total_app1
sum(df.start$unique_users[df.start$app1 & df.start$app2])/total_app1
sum(df.start$unique_users[df.start$app1 & df.start$app3])/total_app1

sum(df.start$unique_users[df.start$app2 & df.start$app1])/total_app2
sum(df.start$unique_users[df.start$app2 & df.start$app2])/total_app2
sum(df.start$unique_users[df.start$app2 & df.start$app3])/total_app2

sum(df.start$unique_users[df.start$app3 & df.start$app1])/total_app3
sum(df.start$unique_users[df.start$app3 & df.start$app2])/total_app3
sum(df.start$unique_users[df.start$app3 & df.start$app3])/total_app3

显然不是如果我想对包含其他应用程序的数据集进行自动化处理，应该怎么做，但如果它有助于解释我尝试过的内容，我想要包括我一直在做的事情到目前为止。

提前致谢！

编辑：在示例数据中遗漏了一组重要的细节，即应用程序集可以超过两个（例如，对于使用所有三个应用程序的用户，存在一行）。

Answer 1

好的......似乎我在长时间阅读后得到了你想做的事。这主要是关于数据清理的问题，主要任务是为你的公司提供正确的矩阵。让我们从您的df.start开始。

require(stringr) #To handle the app names.
require(magrittr) #Pipe operator.

df.start$appset <- as.character(df.start$appset) %>% str_replace_all('\\[','') %>% str_replace_all('\\[','')
# Remove the annoying '[' and ']' first.

apps <- df.start$appset %>% str_split(';') %>% unlist() %>% unique()
# Get the names of all your apps.

apps.self <- paste(apps,apps,sep = ';')
df.start$appset[match(apps,df.start$appset)] <- apps.self
# Change 'app1' to 'app1;app1' format. 

appset.swap <- sapply(df.start$appset,function(x){paste(rev(unlist(str_split(x,';'))),collapse = ';')})
# Swap the app1;app2 to app2;app1. 

df.start <- rbind(df.start,data.frame(appset = appset.swap,unique_users = df.start$unique_users,row.names = NULL)) %>% unique()
# Assign values to the swapped appset, and merge with df.start. Now the dataframe looks much better.

df.start <- df.start[order(df.start$appset),]
mat <- matrix(df.start$unique_users,nrow = length(apps),ncol = length(apps))
# Arrange your appset alphabetically, and make the matrix.

mat <- sweep(mat,2,colSums(mat),'/')
diag(mat) <- 1
rownames(mat) <- apps
colnames(mat) <- apps
df.end <- as.data.frame(mat)
#Done.

我有点困惑，为什么对角线应该是1.单个应用程序用户的信息将会丢失。

Answer 2

从冯大量借用但有一些重要的变化，这里是完成我的问题的代码：

library(tidyverse)
library(stringr)

# Starting with the data
df.start <- data.frame(appset = c("[app1]","[app2]","[app3]","[app1;app2]","[app2;app3]","[app1;app3]","[app1;app2;app3]"),
                       unique_users = c(1000, 400, 150, 300, 30, 130,10))

# Remove [ ] and " characters first
df.start$appset <- as.character(df.start$appset) %>%
                   str_replace_all('\\[','') %>%
                   str_replace_all('\\]','') %>%
                   str_replace_all('\"','')

# Get unique names of the apps and alphabetize
apps <- df.start$appset %>%
        str_split(';') %>%
        unlist() %>%
        unique() %>%
        sort(decreasing = FALSE)

# Calculate the matrix of overlapping usage
apps.mat <- sapply(apps, function(m) sapply(apps, function(n) sum(df.start$unique_users[grepl(m,df.start$appset) & grepl(n,df.start$appset)])))
# This is the first critical change needed - this approach deals with any
# number of possible apps and combinations of those apps (not just if they 
# are initially reported in pairs.

# Normalize each row by diagonal (e.g. combined usage / total usage per app)  
apps.mat.norm <- sweep(apps.mat,1,diag(apps.mat),'/')
# Second critical change is switching the margin in sweep to 1 (rows) and 
# the stat to diag().  This way each row is normalized by the overlap
# of the apps usage to itself (i.e. total unique users in that app
# regardless of other app usage).  The diagonal should represent 100% 
# overlap between an app and itself.

我认为我需要做出的一些改变是因为我解决了这个问题。我为此道歉，但非常感谢我在处理我遇到的一些数据管理问题方面的巨大帮助！

使用R将数据框更改为合适的热图矩阵

2 个答案: