我正在尝试独立计算二元组,例如“ John Doe”和“ Doe John”应合计为2。
已经尝试了使用文本挖掘的一些示例,例如https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html上提供的示例,但是找不到任何忽略出现顺序的计数。
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
像这样分开计数:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
它应该像这样:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
谢谢任何人能帮助我。
答案 0 :(得分:0)
此代码有效。不过,可能还有一些更有效的方法。
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1
答案 1 :(得分:0)
Tks Guys,
我考虑了您的建议,并尝试了类似的方法:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2