我正在尝试准备数据框以提供networkd3的forceNetwork
函数
以下是我的数据示例:
structure(list(Case.Number = c("127967", "127967", "127967",
"127967", "141330", "141330", "141330", "141330", "141240", "141240",
"141240"), Word = c("account", "want", "membership", "sort",
"unhappi", "vr", "info", "miss", "csrf", "unhappi", "dissatisfi"
)), .Names = c("Case.Number", "Word"), class = c("data.table",
"data.frame"), row.names = c(NA, -11L))
对于每个案例编号的单词,我想生成一个数据框,其中包含两列所有可能(和唯一)两个单词组合,如下所示,没有相同列的重复组合(包括反向顺序)且没有组合同一个字
127967 account want
127967 account membership
127967 account sort
127967 want membership
127967 want sort
141330 unhappi vr
141330 unhappi info...
excluding
141330 unhappi unhappi
我尝试了以下方法来获取组合:
source <- c("remove")
target <- c("remove")
ID <- c("remove")
df <- data.frame(ID = c("remove"), source = c("remove"), target = c("remove"))
for(i in unique(tbl$Case.Number)){
for (r in grep(i, tbl$Case.Number)) {
if(r < max(grep(i, tbl$Case.Number))){
ID <- i
source <- tbl$Word[r]
target <- tbl$Word[r+1]
rbind(df, cbind(ID, source,target))
}
}
}
View(df)
但它不起作用。
有更清洁的方式吗?
答案 0 :(得分:2)
自我加入然后过滤:
setkey(dd, Case.Number)
dd[dd, allow.cartesian = TRUE][Word < i.Word]
# Case.Number Word i.Word
# 1: 127967 account want
# 2: 127967 membership want
# 3: 127967 sort want
# 4: 127967 account membership
# 5: 127967 account sort
# 6: 127967 membership sort
# 7: 141240 csrf unhappi
# 8: 141240 dissatisfi unhappi
# 9: 141240 csrf dissatisfi
# 10: 141330 info unhappi
# 11: 141330 miss unhappi
# 12: 141330 unhappi vr
# 13: 141330 info vr
# 14: 141330 miss vr
# 15: 141330 info miss
答案 1 :(得分:1)
<强>已更新强>
使用tidyr::expand
...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
expand(Word, i.Word = Word) %>%
filter(Word < i.Word)
这是一种tidyverse
方式(比下面原版更复杂,利用@Gregor的简单过滤方法)......
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
mutate(i.Word = Word) %>%
complete(Word, i.Word) %>%
filter(Word < i.Word)
# A tibble: 15 x 3
# Groups: Case.Number [3]
Case.Number Word i.Word
<int> <chr> <chr>
1 127967 account membership
2 127967 account sort
3 127967 account want
4 127967 membership sort
5 127967 membership want
6 127967 sort want
7 141240 csrf dissatisfi
8 141240 csrf unhappi
9 141240 dissatisfi unhappi
10 141330 info miss
11 141330 info unhappi
12 141330 info vr
13 141330 miss unhappi
14 141330 miss vr
15 141330 unhappi vr
这是一种tidyverse
方式(如果有点复杂的话)......
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
as_tibble(df) %>%
group_by(Case.Number) %>%
mutate(Word = list(as_data_frame(t(combn(unlist(Word), 2))))) %>%
unique() %>%
unnest(Word)
如果您按顺序运行以下命令以查看它们的作用,则会更容易理解。 combn
可以将您的矢量扩展为所有可能的组合。
vec <- c("account", "want", "membership", "sort")
combn(vec, 2)
t(combn(vec, 2))
as_data_frame(t(combn(vec, 2)))