使用R来计算句子中的常用词

时间:2016-01-30 01:16:00

标签: r

下面显示的我的数据框(usr.bind)包含列,即查询1和查询2。我想找到两个查询之间常用单词的计数,并将其添加到“分数”

picture of code

这是我尝试的但是我的所有行都得到相同的分数而且不知道为什么。

usr.bind$Score <- length(intersect(unlist(usr.bind$query1), unlist(usr.bind$query2)))

我也试过

usr.bind$Score <- length(intersect(unlist(strsplit((usr.bind$query1)," ")), unlist(strsplit((usr.bind$query2), " "))))

但收到错误Error in strsplit((usr.bind$query2), " ") : non-character argument

我也尝试使用as.character,但所有人的得分都相同。 有人可以告诉我,我做错了什么。?

PS:对不起图像形式的数据

我不确定这是否是添加数据的正确方法..但根据建议,这里是查询1和2

所需结果:查询1和查询2之间的常用单词计数

    > dput(head(usr.bind[1:5]))
structure(list(uid1 = structure(c(3L, 25L, 39L, 50L, 59L, 62L
), .Label = c("A0001", "A0005", "A0008", "A0009", "A0010", "A0011", 
"A0015", "A0018", "A0019", "A0020", "A0021", "A0022", "A0024", 
"A0025", "A0026", "A0029", "A0030", "A0033", "A0034", "A0037", 
"A0039", "A0040", "A0041", "A0042", "A0043", "A0044", "A0046", 
"A0047", "A0048", "A0049", "A0050", "A0052", "A0054", "A0056", 
"A0057", "A0059", "A0061", "A0064", "A0065", "A0066", "A0067", 
"A0069", "A0071", "A0073", "A0074", "A0075", "A0077", "A0080", 
"A0081", "A0082", "A0084", "A0087", "A0088", "B0005", "B0007", 
"B0009", "B0012", "B0013", "B0018", "B0020", "B0025", "B0026", 
"B0027"), class = "factor"), uid2 = structure(c(3L, 3L, 3L, 3L, 
3L, 3L), .Label = c("A0001", "A0005", "A0008", "A0009", "A0010", 
"A0011", "A0015", "A0018", "A0019", "A0020", "A0021", "A0022", 
"A0024", "A0025", "A0026", "A0029", "A0030", "A0033", "A0034", 
"A0037", "A0039", "A0040", "A0041", "A0042", "A0043", "A0044", 
"A0046", "A0047", "A0048", "A0049", "A0050", "A0052", "A0054", 
"A0056", "A0057", "A0059", "A0061", "A0064", "A0065", "A0066", 
"A0067", "A0069", "A0071", "A0073", "A0074", "A0075", "A0077", 
"A0080", "A0081", "A0082", "A0084", "A0087", "A0088", "B0005", 
"B0007", "B0009", "B0012", "B0013", "B0018", "B0020", "B0025", 
"B0026", "B0027"), class = "factor"), query1 = structure(1:6, .Label = c("how to get main method\n new scanner (system.in)\n nextInt()\n do loop\n while-do loop\n what meaning of /n\n nextString\n how to converse case\n how to converse downcase to upcase\n how to converse down case to up case\n how to use euqals to ignoring case\n number format persentage\n use number format to get persentage\n simple\n sample\n JRadioButton\n how to transfer int to color\n how to transfer int to Color\n Color[]\n what method can decide character to operand \n askto method\n ask to method\n", 
"sorting numbers in a array\n", "initialize array list\n", "abstract classes\n subclass\n /n two in a row\n", 
"what is the length method\n how do you know whats private or public\n whats the symbol for private method\n how to create a subclass\n how to create a subclass in java\n how to write a toString\n how to format decimals\n how to use java.text.DecimalFormat\n how to use java.text.DecimalFormat in a string\n", 
"How to call from other class\n How to call methods from other class\n call method from other class\n print method from other class\n call private method from other class\n print private value from other method\n print private value from other class\n parser\n parser java\n array\n read from\n read from java\n read string from java\n parseInteger\n"
), class = "factor"), query2 = structure(c(1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("how to get main method\n new scanner (system.in)\n nextInt()\n do loop\n while-do loop\n what meaning of /n\n nextString\n how to converse case\n how to converse downcase to upcase\n how to converse down case to up case\n how to use euqals to ignoring case\n number format persentage\n use number format to get persentage\n simple\n sample\n JRadioButton\n how to transfer int to color\n how to transfer int to Color\n Color[]\n what method can decide character to operand \n askto method\n ask to method\n", 
"sorting numbers in a array\n", "initialize array list\n", "abstract classes\n subclass\n /n two in a row\n", 
"what is the length method\n how do you know whats private or public\n whats the symbol for private method\n how to create a subclass\n how to create a subclass in java\n how to write a toString\n how to format decimals\n how to use java.text.DecimalFormat\n how to use java.text.DecimalFormat in a string\n", 
"How to call from other class\n How to call methods from other class\n call method from other class\n print method from other class\n call private method from other class\n print private value from other method\n print private value from other class\n parser\n parser java\n array\n read from\n read from java\n read string from java\n parseInteger\n"
), class = "factor"), Score = c(94L, 94L, 94L, 94L, 94L, 94L)), .Names = c("uid1", 
"uid2", "query1", "query2", "Score"), row.names = c(NA, 6L), class = "data.frame")

2 个答案:

答案 0 :(得分:0)

如下:

count_shared_words <- function(s1, s2){
  l1 <- unique(strsplit(s1, split='[\\s::punct::]+')
  l2 <- unique(strsplit(s2, split='[\\s::punct::]+')
  length(intersect(l1, l2))
}

然后你可以mapply()这样:

df <- data.frame(
  a = c('the falcon caught the flying squirrels',
         'the sunny days are the worst'),
  b = c('flying with squirrels makes me nervous',
       'days that are sunny make me happy'),
  stringsAsFactors = FALSE)

df$shared_count <- mapply(count_shared_words, s1=df$a, s2=df$b)

df

non-character argument error的原因可能是您正在将其他数据类型传递到函数中 - 尝试使用as.character()

将字符串转换为真正的字符串(而不是因子)

答案 1 :(得分:0)

我不知道,我是否理解你的目标,但考虑这样的代码:

countOfSame <- function(s)
{
  merged <- merge(unlist(strsplit(s[1]," ")),unlist(strsplit(s[2]," ")))
  return(sum(apply(merged[!duplicated(merged),],1,function(x) {ifelse(toupper(x[1]) == toupper(x[2]),TRUE,FALSE)})))
}

data <- rbind(c("foo bar","foo jar"),c("foo bar","bar foo"),c("foo foo bar bar","bar"))
cbind(data,apply(data,1,countOfSame))

#result:

     [,1]              [,2]      [,3]
[1,] "foo bar"         "foo jar" "1" 
[2,] "foo bar"         "bar foo" "2" 
[3,] "foo foo bar bar" "bar"     "1" 

它并不太优雅,但countOfSame采用两个字符串的向量,并返回两个字符串共有的字数(不区分大小写)。然后,您只需使用apply将其应用于矩阵或数据帧的两列。