Question

我将从论坛中删除的短信集合转换为数据框。这是一个可重复的例子：

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

我的想法是找到同一作者的每条消息上找到的最长的通用后缀，供所有写过消息的人使用。对于所有其他人，我会找到一种优雅降级的正则表达方式。

我发现biibonductor包RLibstree看起来很有前途，这要归功于函数getLongestCommonSubstring，但我不知道如何将函数分组到来自同一作者的所有消息。

Answer 1

我想我会转换为以下格式的列表，并使用stringdist包查找常用句子，并删除作者使用的所有句子的相似阈值以上。 outer也可能在这里使用：

## load packages in this order
library(stringi)
library(magrittr)

example.df[["message"]] %>% 
    stringi::stri_split_regex(., "(?<=[.?!]{1,5})\\s+") %>%
    split(example.df[["author"]])

## $Daisy
## $Daisy[[1]]
## [1] "Quack Quack!" "Daisy Duck"  
## 
## $Daisy[[2]]
## [1] "Quack Quack!" "Daisy Duck"  
## 
## 
## $Donald
## $Donald[[1]]
## [1] "Quack Quack!" "Donald Duck" 
## 
## 
## $Mikey
## $Mikey[[1]]
## [1] "Hello World!" "Mikey Mouse" 
## 
## $Mikey[[2]]
## [1] "I was born in 1928." "Mikey Mouse"        
## 
## 
## $Minnie
## $Minnie[[1]]
## [1] "The quick fox jump over Minnie Mouse"

Answer 2

我不知道如何将该功能分组到来自的所有消息同一作者。

或许tapply正是您所寻找的。

> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"

$Donald
[1] "Quack Quack! Donald Duck"

$Mikey
[1] "Hello World! Mikey Mouse"        "I was born in 1928. Mikey Mouse"

$Minnie
[1] "The quick fox jump over Minnie Mouse"

当然，您可以使用自己的功能代替function(x) x。

Answer 3

这是一个不使用其他库的实现。

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
                                  "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

signlen = function(am)  # determine signature length of an author's messages
{
    if (length(am) <= 1) return(0)  # return if not more than 1 message

    # turn the messages into reversed vectors of single characters
    # in order to conveniently access the suffixes from index 1 on
    am = lapply(strsplit(as.character(am), ''), rev)
    # find the longest common suffix in the messages
    longest_common = .Machine$integer.max
    for (m in 2:length(am))
    {
        i = 1
        max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
        while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
        longest_common = i-1
        if (longest_common == 0) return(0)  # shortcut: need not look further
    }
    return(longest_common)
}

# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald  Mikey Minnie 
#    23      0     12      0 

# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald  Mikey  Daisy Minnie  Daisy 
#    12     24     19      0     36      0 

# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
#  author                              message
#1  Mikey                         Hello World!
#2 Donald             Quack Quack! Donald Duck
#3  Mikey                  I was born in 1928.
#4  Daisy                                     
#5 Minnie The quick fox jump over Minnie Mouse
#6  Daisy

使用R检测并删除论坛短信中的签名

3 个答案: