Question

我想加快以下代码的速度。可以请一个这样的人提出一些建议吗？

currencies.ForEach(c => 
    Console.WriteLine($"drzava_iso: {c.Drzava_iso}, Sifra_valute: {c.Sifra_valute}, ...")
);

通过我尝试使用proxy :: dist和其他各种方法的方式均未成功。我也认为字符串距离函数不能按预期工作，但这是另一回事。

最终，我想使用距离矩阵来执行一些聚类，以对相似的字符串进行分组（与词序无关）。

Answer 1

如果需要矩阵，可以使用stringdist包。据我所知，您使用的软件包使用的是Levenshtein距离，因此我包含了method = "lv"（您也可以尝试其他方法）。让我知道您是否有问题，或者是否首选矩阵以外的格式。另外，您可以考虑使用Levenshtein距离以外的方法（即，四个字母单词中的2更改与20单词句子中的两个更改相同）。祝你好运！

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")

rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string

dist_mat
                        hello world hello vorld hello world 1 hello world hello world hello world
hello world                       0           1             2           0                      12
hello vorld                       1           0             3           1                      13
hello world 1                     2           3             0           2                      11
hello world                       0           1             2           0                      12
hello world hello world          12          13            11          12                       0

创建字符串的距离矩阵

1 个答案: