List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"

我想找出将“ list1”转换为“ list2”所需的操作。 如您所见,我只需要执行两个操作: 1. Replace "red" with "blue". 2. Replace "car" with "bus".

但是,我们如何自动找到这样的动作数量。 我们可以采取多种动作来转换句子:添加,删除或替换列表中的单词。 现在,我将尽力解释该算法如何工作:


行:i = 0、1、2、3,    列:j = 0、1、2、3

(example: value[0,0] = 0 , value[0, 1] = 1 ...)

                 crashed    red     car
         0          1        2       3

crashed  1
blue     2
bus      3

现在,我将尝试填写表格。请注意,表格中的每个单元格都显示了我们需要重新设置句子格式(ADD,删除或替换)的操作数。 考虑“崩溃”和“崩溃” value[1,1])之间的相互作用,显然我们不需要更改它,因此值将为'0'。因为它们是相同的词。基本上,我们得到了对角线值 = value[0,0]

                 crashed    red     car
         0          1        2       3

crashed  1          0
blue     2
bus      3


min{value[0,1] , value[0,2] and value[1,1]} + 1 
min{ 1, 2, 0} + 1 = 1 

因此,我们只需要删除“红色”即可。 因此,该表将如下所示:

                 crashed    red     car
         0          1        2       3

crashed  1          0        1
blue     2  
bus      3

我们将继续这样: “坠毁”和“汽车”将为:

min{value[0,3], value[0,2] and value[1,2]} + 1 
min{3, 2, 1} +1 = 2


                 crashed    red     car
         0          1        2       3

crashed  1          0        1       2
blue     2  
bus      3


             crashed    red     car
         0      1        2       3

crashed  1      0        1       2
blue     2      1        1       2
bus      3      2        2       2 

您可以看到表格中的最后一个数字显示了两个句子之间的距离: value [3,3] = 2


 if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] & 
                                            value[i,j] == value[i+1][j-1] )

then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}

value[i,j] = min(value[i-1, j], value[i-1, j-1],  value[i, j-1]) + 1

为了找到您可以在矩阵标题和列中看到的两个列表的元素之间的差异,我使用了strcmp()函数,该函数将为我们提供布尔值(TRUE或FALSE),而比较单词。但是,我无法实现这一点。 多谢您的帮助,谢谢。

2 个答案:

答案 0 :(得分:2)




  1. 使用长度为string_1string_2的{​​{1}}和n_1初始化
  2. 计算前n_2个元素之间的累积差
  3. 将此累积差用作矩阵中的对角线
  4. 将第一个非对角元素设置为第一个元素+ 1
  5. 将剩余的对角元素计算为:min(n_1, n_2)
  6. 在上一个步骤中,我遍历对角线,j遍历行/列(两个都可行),我们从第三个对角线开始,因为第一个2x2矩阵在步骤1至4中被填充
  7. 将其余diag(i) - diag(i-1) + full_matrix(i-1,j)个元素计算为abs(n_1 - n_2),将后者应用于前一个的每个值,并将它们适当地绑定到full_matrix。



full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2)




Dist_between_strings <- function(x, y, 
                                 split = " ", 
                                 split_x = split, split_y = split, 
                                 case_sensitive = TRUE){
  #Safety checks
  if(!is.character(x) || !is.character(y) || 
     nchar(x) == 0 || nchar(y) == 0)
    stop("x, y needs to be none empty character strings.")
  if(length(x) != 1 || length(y) != 1)
    stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
    stop("case_sensitivity needs to be logical")
  #Extract variable names of our variables
  # used for the dimension names later on
  x_name <- deparse(substitute(x))
  y_name <- deparse(substitute(y))
  #Expression which when evaluated will name our output
  dimname_expression <- 
    parse(text = paste0("dimnames(output) <- list(",x_name," = x_names,",
                        y_name," = y_names)"))
  #split the strings into words
  x_names <- str_split(x, split_x, simplify = TRUE)
  y_names <- str_split(y, split_y, simplify = TRUE)
  #are we case_sensitive?
    x_split <- str_split(tolower(x), split_x, simplify = TRUE)
    y_split <- str_split(tolower(y), split_y, simplify = TRUE)
    x_split <- x_names
    y_split <- y_names
  #Create an index in case the two are of different length
  idx <- seq(1, (n_min <- min((nx <- length(x_split)),
                              (ny <- length(y_split)))))
  n_max <- max(nx, ny)
  #If we have one string that has length 1, the output is simplified
  if(n_min == 1){ 
    distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
    output <- matrix(distances, nrow = nx)
  #If not we will have to do a bit of work
  output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
  #The loop will fill in the off_diagonal
  output[2, 1] <- output[1, 2] <- output[1, 1] + 1 
  if(n_max > 2)
    for(i in 3:n_min){
      for(j in 1:(i - 1)){
        output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
          output[i - 1, j] #How many words were different before?
  #comparison if the list is not of the same size
  if(nx != ny){
    #Add the remaining words to the side that does not contain this
    additional_words <- seq(1, n_max - n_min)
    additional_words <- sapply(additional_words, function(x) x + output[,n_min])
    #merge the additional words
    if(nx > ny)
      output <- rbind(output, t(additional_words))
      output <- cbind(output, additional_words)
  #set the dimension names, 
  # I would like the original variable names to be displayed, as such i create an expression and evaluate it


list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)

这不完全是输出,但是它产生相同的信息,因为单词的顺序与字符串中给出的顺序相同。 更多示例 现在我说它也适用于其他字符串,这确实是事实,所以让我们尝试一些用户自定义的随机字符串:

#Strings in the given example
string_1  crashed blue bus
  crashed       0    1   2
  red           1    1   2
  car           2    2   2



现在,作为正确的实现被接受之后,我们想知道它的执行效果如何(对于不感兴趣的读者,可以滚动到本节的内容,给出更快速的实现)。为此,我将使用更大的字符串。为了获得完整的基准,我应该测试各种字符串大小,但出于这个目的,我将仅使用2个大小分别为1000和2500的较大字符串。为此,我使用R中的#More complicated strings string_3 <- "I am not a blue whale" string_4 <- "I am a cat" string_5 <- "I am a beautiful flower power girl with monster wings" string_6 <- "Hello" Dist_between_strings(string_3, string_4, case_sensitive = TRUE) Dist_between_strings(string_3, string_5, case_sensitive = TRUE) Dist_between_strings(string_4, string_5, case_sensitive = TRUE) Dist_between_strings(string_6, string_5) 包,其中包含一个{{1 }}函数,声称精确到纳秒。函数本身执行代码100次(或由用户定义)的次数,返回运行时间的平均值和四分位数。由于R的其他部分(例如垃圾清理器),因此中位数通常被认为是该函数实际平均运行时间的良好估计。 执行和结果如下所示:



现在,我发现运行时间非常慢。该实现的一个用例可以是对学生上课情况进行初步检查,以检查是否存在窃行为,在这种情况下,差异少的几率很可能表明存在窃行为。这些可能会很长,可能会有数百个handin,因此我希望运行速度非常快。 为了弄清楚如何改进我的实现,我使用了microbenchmark包和相应的#Benchmarks for larger strings set.seed(1) string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ") string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ") microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE)) # Unit: milliseconds # expr min lq mean median uq max neval # String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959 100 函数。为了分析该函数,我将其导出到另一个R脚本中,该脚本由我提供,在分析前运行一次代码1来编译代码并避免分析噪声(重要)。运行配置文件的代码可以在下面看到,输出的最重要部分在下面的图像中可视化。


Profiling of the string differences

现在,尽管有颜色,但在这里我可以看到一个明显的问题。到目前为止,填充非对角线的循环负责大部分运行时间。 R(如python和其他未编译的语言)循环非常慢。


为改善实现,我们可以使用profvis包在c ++中实现循环。这很简单。如果避免迭代器,该代码与我们将在R中使用的代码相同。可以在文件->新文件-> c ++文件中创建c ++脚本。以下c ++代码将粘贴到相应的文件中,并使用“源”按钮获取。

profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))

需要更改相应的R函数以使用此函数而不是循环。该代码类似于第一个函数,只是切换了对c ++函数的调用的循环。


测试c ++实现

为确保实现正确,我们检查c ++实现是否获得相同的输出。

//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
  long nrow = output.nrow();
  for(long i = 2; i < nrow; i++){ // note the 
    for(long j = 0; j < i; j++){
      output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
                                  output(i - 1, j);
      output(j, i) = output(i, j);
  return output;


现在这实际上更快吗?要看到这一点,我们可以使用Dist_between_strings_cpp <- function(x, y, split = " ", split_x = split, split_y = split, case_sensitive = TRUE){ #Safety checks if(!is.character(x) || !is.character(y) || nchar(x) == 0 || nchar(y) == 0) stop("x, y needs to be none empty character strings.") if(length(x) != 1 || length(y) != 1) stop("Currency the function is not vectorized, please provide the strings individually or use lapply.") if(!is.logical(case_sensitive)) stop("case_sensitivity needs to be logical") #Extract variable names of our variables # used for the dimension names later on x_name <- deparse(substitute(x)) y_name <- deparse(substitute(y)) #Expression which when evaluated will name our output dimname_expression <- parse(text = paste0("dimnames(output) <- list(",x_name," = x_names,", y_name," = y_names)")) #split the strings into words x_names <- str_split(x, split_x, simplify = TRUE) y_names <- str_split(y, split_y, simplify = TRUE) #are we case_sensitive? if(isTRUE(case_sensitive)){ x_split <- str_split(tolower(x), split_x, simplify = TRUE) y_split <- str_split(tolower(y), split_y, simplify = TRUE) }else{ x_split <- x_names y_split <- y_names } #Create an index in case the two are of different length idx <- seq(1, (n_min <- min((nx <- length(x_split)), (ny <- length(y_split))))) n_max <- max(nx, ny) #If we have one string that has length 1, the output is simplified if(n_min == 1){ distances <- seq(1, n_max) - (x_split[idx] == y_split[idx]) output <- matrix(distances, nrow = nx) eval(dimname_expression) return(output) } #If not we will have to do a bit of work output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1))) #The loop will fill in the off_diagonal output[2, 1] <- output[1, 2] <- output[1, 1] + 1 if(n_max > 2) output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code #comparison if the list is not of the same size if(nx != ny){ #Add the remaining words to the side that does not contain this additional_words <- seq(1, n_max - n_min) additional_words <- sapply(additional_words, function(x) x + output[,n_min]) #merge the additional words if(nx > ny) output <- rbind(output, t(additional_words)) else output <- cbind(output, additional_words) } #set the dimension names, # I would like the original variable names to be displayed, as such i create an expression and evaluate it eval(dimname_expression) output } 包运行另一个基准测试。代码和结果如下所示:

#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
          Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))


答案 1 :(得分:-1)



s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")

ll <- list(s1, s2)

alnum <- c(letters, LETTERS, 0:9)

ll2 <- relist(alnum[factor(unlist(ll))], ll)

ll2 <- sapply(ll2, paste, collapse="")

#      [,1] [,2]
# [1,]    0    2
# [2,]    2    0

据我所知,这里的主要限制是可用的唯一字符数,在这种情况下为62,但是可以很容易地扩展,具体取决于您的语言环境。例如:intToUtf8(c(32:126, 161:300), TRUE)