独立于出现顺序对Bigrams进行计数

时间:2019-08-27 20:15:08

标签: r text-mining

我正在尝试独立计算二元组,例如“ John Doe”和“ Doe John”应合计为2。

已经尝试了使用文本挖掘的一些示例,例如https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html上提供的示例,但是找不到任何忽略出现顺序的计数。

library('widyr')
word_pairs <- austen_section_words %>%
  pairwise_count(word, section, sort = TRUE)
word_pairs

像这样分开计数:

   <chr>     <chr>     <dbl>
 1 darcy     elizabeth 144  
 2 elizabeth darcy     144

它应该像这样:

   item1     item2     n
   <chr>     <chr>     <dbl>
 1 darcy     elizabeth 288

谢谢任何人能帮助我。

2 个答案:

答案 0 :(得分:0)

此代码有效。不过,可能还有一些更有效的方法。

# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))

# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);

# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)

library(dplyr)
summarize(group_by(df, a, b), n())

# Yields
#  a     b         `n()`
#  <chr> <chr>     <int>
#1 darcy elizabeth     2
#2 Doe   John          2
#3 Smith Steve         1

答案 1 :(得分:0)

Tks Guys,

我考虑了您的建议,并尝试了类似的方法:

library(dplyr)
#Function to order 2 variables by alphabetical order. 
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}

#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")

dfSample<-data.frame(col1,col2)

#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())

#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {

  row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2])) 

  if(!alphabetical(row[1],row[2])){
    row <- c(row[2],row[1])
  }

  dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)

}
colnames(dfCreated)<-c("col1","col2")

dfCreated

#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())

col1  col2      `n()`
  <chr> <chr>     <int>
1 darcy elizabeth     4
2 doe   john          2