从ID列列值相等的一列单词生成两列单词组合

时间:2018-01-17 18:27:03

标签: r networkd3

我正在尝试准备数据框以提供forceNetwork函数

以下是我的数据示例:

structure(list(Case.Number = c("127967", "127967", "127967", 
"127967", "141330", "141330", "141330", "141330", "141240", "141240", 
"141240"), Word = c("account", "want", "membership", "sort", 
"unhappi", "vr", "info", "miss", "csrf", "unhappi", "dissatisfi"
)), .Names = c("Case.Number", "Word"), class = c("data.table", 
"data.frame"), row.names = c(NA, -11L))

对于每个案例编号的单词,我想生成一个数据框,其中包含两列所有可能(和唯一)两个单词组合,如下所示,没有相同列的重复组合(包括反向顺序)且没有组合同一个字

127967 account want
127967 account membership
127967 account sort
127967 want    membership
127967 want    sort
141330 unhappi vr
141330 unhappi info...

excluding
141330 unhappi unhappi

我尝试了以下方法来获取组合:

source <- c("remove")
target <- c("remove")
ID <- c("remove")
df <- data.frame(ID = c("remove"), source = c("remove"), target = c("remove"))

for(i in unique(tbl$Case.Number)){
  for (r in grep(i, tbl$Case.Number)) {
    if(r < max(grep(i, tbl$Case.Number))){
      ID <- i
      source <- tbl$Word[r]
      target <- tbl$Word[r+1]
      rbind(df, cbind(ID, source,target))
    }

  }

}

View(df) 

但它不起作用。

有更清洁的方式吗?

2 个答案:

答案 0 :(得分:2)

自我加入然后过滤:

setkey(dd, Case.Number)
dd[dd, allow.cartesian = TRUE][Word < i.Word]
#     Case.Number       Word     i.Word
#  1:      127967    account       want
#  2:      127967 membership       want
#  3:      127967       sort       want
#  4:      127967    account membership
#  5:      127967    account       sort
#  6:      127967 membership       sort
#  7:      141240       csrf    unhappi
#  8:      141240 dissatisfi    unhappi
#  9:      141240       csrf dissatisfi
# 10:      141330       info    unhappi
# 11:      141330       miss    unhappi
# 12:      141330    unhappi         vr
# 13:      141330       info         vr
# 14:      141330       miss         vr
# 15:      141330       info       miss

答案 1 :(得分:1)

<强>已更新

使用tidyr::expand ...

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

df %>% 
  group_by(Case.Number) %>% 
  expand(Word, i.Word = Word) %>% 
  filter(Word < i.Word)

这是一种tidyverse方式(比下面原版更复杂,利用@Gregor的简单过滤方法)......

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

df %>% 
  group_by(Case.Number) %>% 
  mutate(i.Word = Word) %>% 
  complete(Word, i.Word) %>% 
  filter(Word < i.Word)

# A tibble: 15 x 3
# Groups: Case.Number [3]
   Case.Number Word       i.Word    
         <int> <chr>      <chr>     
 1      127967 account    membership
 2      127967 account    sort      
 3      127967 account    want      
 4      127967 membership sort      
 5      127967 membership want      
 6      127967 sort       want      
 7      141240 csrf       dissatisfi
 8      141240 csrf       unhappi   
 9      141240 dissatisfi unhappi   
10      141330 info       miss      
11      141330 info       unhappi   
12      141330 info       vr        
13      141330 miss       unhappi   
14      141330 miss       vr        
15      141330 unhappi    vr

这是一种tidyverse方式(如果有点复杂的话)......

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

as_tibble(df) %>% 
  group_by(Case.Number) %>% 
  mutate(Word = list(as_data_frame(t(combn(unlist(Word), 2))))) %>% 
  unique() %>% 
  unnest(Word)

如果您按顺序运行以下命令以查看它们的作用,则会更容易理解。 combn可以将您的矢量扩展为所有可能的组合。

vec <- c("account", "want", "membership", "sort")
combn(vec, 2)
t(combn(vec, 2))
as_data_frame(t(combn(vec, 2)))